[ANN] ocaml-git 2.0

mirageos
announce
git

#1

I’m very happy to announce a new major release of ocaml-git (2.0). This release is a 2-year effort to get a revamped streaming API offering a full control over memory allocation. This new version also adds production-ready implementations of the wire protocol: git push and git pull now work very reliably using the raw Git and smart HTTP protocol (SSH support will come soon). git gc is also implemented, and all of the basic bricks are now available to create Git servers. MirageOS support is available out-of-the-box.

Two years ago, we decided to rewrite ocaml-git and split it into standalone libraries. More details about these new libraries are also given below.

But first, let’s focus on ocaml-git's new design. The primary goal was to fix memory consumption issues that our users noticed with the previous version, and to make git push work reliably. We also took care about not breaking the API too much, to ease the transition for current users.

Controlled allocations

There is a big difference in the way ocaml-git and git are designed: git is a short-lived command-line tool which does not care that much about allocation policies, whereas we wanted to build a library that can be linked with long-lived Git client and/or server applications. We had to make some (performance) compromises to support that use-case, at the benefit of tighter allocation policies — and hence more predictable memory consumption patterns. Other Git libraries such as libgit2 also have to deal with similar concerns.

In order to keep a tight control on the allocated memory, we decided to use decompress instead of camlzip. decompress allows the users to provide their own buffer instead of allocating dynamically. This allowed us to keep a better control on memory consumption. See below for more details on decompress.

We also used angstrom and encore to provide a streaming interface to encode and decode Git objects. The streaming API is currently hidden to the end-user, but it helped us a lot to build abstraction and, again, on managing the allocation policy of the library.

Complete PACK file support (including GC)

In order to find the right abstraction for manipulating pack files in a long-lived application, we experimented with various prototypes. We haven’t found the right abstractions just yet, but we believe the PACK format could be useful to store any kind of data in the future (and not especially Git objects).

We implemented git gc by following the same heuristics as Git to compress pack files and we produce something similar in size — decompress has a good ratio about compression — and we are using duff, our own implementation of xdiff, the binary diff algorithm used by Git (more details on duff below). We also had to re-implement the streaming algorithm to reconstruct idx files on the fly, when receiving pack file on the network.

One notable feature of our compression algorithms is they work without the assumption that the underlying system implements POSIX: hence, they can work fully in-memory, in a browser using web storage or inside a MirageOS unikernel with wodan.

Production-ready push and pull

We re-implemented and abstracted the Git Smart protocol, and used that abstraction to make git push and git pull work over HTTP. By default we provide a cohttp implementation but users can use their own — for instance based on httpaf. As proof-of-concept, the initial pull-request was created using this new implementation; moreover, we wrote a prototype of a Git client compiled with js_of_ocaml, which were able to run git pull over HTTP inside a browser!

Finally, that implementation will allow MirageOS unikernels to synchronize their internal state with external Git stores (hosted for instance on GitHub) using push/pull mechanisms. We also expect to release a server-side implementation of the smart HTTP protocol, so that the state of any unikernel can be inspected via git pull. Stay tuned for more updates on that topic!

Standalone Dependencies

Below you can find the details of the new stable releases of libraries that are used by ocaml-git 2.0.

optint and checkseum

In some parts of ocaml-git, we need to compute a Circular Redundancy Check value. It is 32-bit integer value. optint provides an abstraction of it but structurally uses an unboxed integer or a boxed int32 value depending on target (32 bit or 64 bit architecture).

checkseum relies on optint and provides 3 implementations of CRC:

  • Adler32 (used by zlib format)
  • CRC32 (used by gzip format and git)
  • CRC32-C (used by wodan)

checkseum uses the linking trick: this means that users of the library program against an abstract API (only the cmi is provided); at link-time, users have to select which implementation to use: checkseum.c (the C implementation) or checkseum.ocaml (the OCaml implementation). The process is currently a bit cumbersome but upcoming dune release will make that process much more transparent to the users.

encore (/angkor/)

In git, we work with Git objects (tree, blob or commit). These objects are encoded in a specific format. Then, the hash of these objects are computed from the encoded result to get a unique identifier. For example, the hash of your last commit is: sha1(encode(commit)).

A common operation in git is to decode Git objects from an encoded representation of them (especially in .git/objects/* as a loose file) and restore them in another part of your Git repository (like in a PACK file or on the command-line).

Hence, we need to ensure that encoding is always deterministic, and that decoding an encoded Git object is always the identity, e.g. there is an isomorphism between the decoder and the encoder.

let decoder <.> encoder : value -> value = id
let encoder <.> decoder : string -> string = id

encore is a library in which you can describe a format (like Git format) and from it, we can derive a streaming decoder and encoder that are isomorphic by construction.

duff

duff is a pure implementation in OCaml of the xdiff algorithm. Git has an optimized representation of your Git repository. It’s a PACK file. This format uses a binary diff algorithm called xdiff to compress binary data. xdiff takes a source A and a target B and try to find common sub-strings between A and B.

This is done by a Rabin’s fingerprint of the source A applied to the target B. The fingerprint can then be used to produce a lightweight representation of B in terms of sub-strings of A.

duff implements this algorithm (with additional Git’s constraints, regarding the size of the sliding windows) in OCaml. It provides a small binary xduff that complies with the format of Git without the zlib layer.

$ xduff diff source target > target.xduff
$ xduff patch source < target.xduff > target.new
$ diff target target.new
$ echo $?
0

decompress

decompress is a pure implementation in OCaml of zlib and rfc1951. You can compress and decompress data flows and, obviously, Git does this compression in loose files and PACK files.

It provides a non-blocking interface and is easily usable in a server context. Indeed, the implementation never allocates and only relies on what the user provides (window, input and output buffer). Then, the distribution provides an easy example of how to use decompress:

val inflate: ?level:int -> string -> string
val deflate: string -> string

digestif

digestif is a toolbox providing many implementations of hash algorithms such as:

  • MD5
  • SHA1
  • SHA224
  • SHA256
  • SHA384
  • SHA512
  • BLAKE2B
  • BLAKE2S
  • RIPEMD160

Like checkseum, digestif uses the linking trick too: from a shared interface, it provides 2 implementations, in C (digestif.c) and OCaml (digestif.ocaml).

Regarding Git, we use the SHA1 implementation and we are ready to migrate ocaml-git to BLAKE2{B,S} as the Git core team expects - and, in the OCaml world, it is just a functor application with another implementation.

eqaf

Some applications require that secret values are compared in constant time. Functions like String.equal do not have this property, so we have decided to provide a small package — eqaf — providing a constant-time equal function. digestif uses it to check equality of hashes — it also exposes unsafe_compare if you don’t care about timing attacks in your application.

Of course, the biggest work on this package is not about the implementation of the equal function but a way to check the constant-time assumption on this function. Using this, we did a benchmark on Linux, Windows and Mac to check it.

An interesting fact is that after various experiments, we replaced the initial implementation in C (extracted from OpenBSD’s timingsafe_memcmp) with an OCaml implementation behaving in a much more predictable way on all the tested platforms.

Conclusion

The upcoming version 2.0 of Irmin is using ocaml-git to create small applications that push and pull their state to GitHub. We think that Git offers a very nice model to persist data for distributed applications and we hope that more people will use ocaml-git to experiment and manipulate application data in Git. Please send us your feedback!


#2

Not only an impressive upgrade to ocaml-git, but an excellent write-up too


#3

When is it usefult to use ocaml-git over standard git? Why is is it called a long lived git client?


#4

The usual git is a set of command line tools for use by humans. ocaml-git is a set of libraries for building git repository manipulating and managing software in ocaml. If you’re trying to write software in ocaml that manipulates a git repo at more than a trivial level or even has intimate knowledge of git internals, this is your library of choice.


#5

Thanks for the answer, do you also know of a practical use case of such a program?


#6

ocaml-git was initialy developped for irmin. The idea behind it is to provide a way to have a persistant store for an unikernel/MirageOS. By this way, we need to apply on some assumptions:

  • make a library to be able to link it (static link) with the rest of the OS
  • use OCaml

ocaml-git provides some binaries (like ogit-write-tree, etc.) but it’s only as little example of how to use this library. The goal is definitely not (at this stage) to provide a new CLI tool. Then, as a library, ocaml-git wants to be used in a server-context to be synchronized with differents endpoints (GitHub, MirageOS, local Git repository, etc.).

Finally, a practical example could be done with irmin when you want to have an access to the store of your unikernel without any access on it (SSH for example). You just need to push to a Git repository which is synchronized with your unikernel and then, your unikernel will load what you push (safely).

A good example is Canopy with is a static blog (unikernel). When you want to add a new article, instead to remake your unikernel, you tell him to be synchronized with a GitHub repository which contains your articles - a real world example is the blog of @hannesm. Of course, we have others examples (like DNS server synchronized with a zonefile available on a Git repository).


#7

I am using ocaml-git in pre 2 version. Will this version be supported? What is the path to migrate to version 2.0?


#8

Interesting, I am interested in using Irmin as a DB for a static web site, that’s why I want to find out more about ocaml-git. I will have a look at Canopy as it is a static blog generator using Irmin, very relevant to my use case. For other scenarios it still seems a bit foreign to me, but I am sure that I will get it after a while :grinning:


#9

How complete is it? Can it determine the status of a working directory, which branch is checked out, how many files are added, deleted and unknown or is it more like an implementation of the data structures used within git?

In particular I have this shell prompt script which is very slow and I would like to replace it without shelling out to a number of git commands while avoiding to also implement half of git itself.


#10

We have full coverage for the on-disk and wire protocol, so reading the current branch is easy.

We also have support for working tree but this is not as well tested (and documented…) than the rest of the code, so you might hit performance and usability issues. We would be happy to support that better, so please feel free to report any issues!


#11

If you use only the store API, it hasn’t changed much: http://mirage.github.io/ocaml-git/git/Git/module-type-S/index.html (the main difference if that you won’t need to use Git.* functions and/or module once you have an implementation at hand as it now includes everything that you need).

If you were using other parts of the API, the changes are bigger as the streaming API is just different than the current one-- we would be very happy to help you migrate if you have any questions. And we could set-up a FAQ with the questions/answers related to migration.


#12

The link to eqaf is broken: “gitnub” -> “github”.

Big thanks for the write-up and all of these nifty libraries! I’ll try to use decompress instead of camlzip in one of my projects, as a measure against DLL hell.

Some benchmark results would be a good addition to README.


#13

Thanks, I’ve fixed the broken link in the blog post too :slight_smile: