I’m very happy to announce a new major release of ocaml-git
(2.0). This release is a 2-year effort to get a revamped streaming API offering a full control over memory allocation. This new version also adds production-ready implementations of the wire protocol: git push
and git pull
now work very reliably using the raw Git and smart HTTP protocol (SSH support will come soon). git gc
is also implemented, and all of the basic bricks are now available to create Git servers. MirageOS support is available out-of-the-box.
Two years ago, we decided to rewrite ocaml-git
and split it into standalone libraries. More details about these new libraries are also given below.
But first, let’s focus on ocaml-git
's new design. The primary goal was to fix memory consumption issues that our users noticed with the previous version, and to make git push
work reliably. We also took care about not breaking the API too much, to ease the transition for current users.
Controlled allocations
There is a big difference in the way ocaml-git
and git
are designed: git
is a short-lived command-line tool which does not care that much about allocation policies, whereas we wanted to build a library that can be linked with long-lived Git client and/or server applications. We had to make some (performance) compromises to support that use-case, at the benefit of tighter allocation policies — and hence more predictable memory consumption patterns. Other Git libraries such as libgit2 also have to deal with similar concerns.
In order to keep a tight control on the allocated memory, we decided to use decompress instead of camlzip
. decompress
allows the users to provide their own buffer instead of allocating dynamically. This allowed us to keep a better control on memory consumption. See below for more details on decompress
.
We also used angstrom and encore to provide a streaming interface to encode and decode Git objects. The streaming API is currently hidden to the end-user, but it helped us a lot to build abstraction and, again, on managing the allocation policy of the library.
Complete PACK file support (including GC)
In order to find the right abstraction for manipulating pack files in a long-lived application, we experimented with various prototypes. We haven’t found the right abstractions just yet, but we believe the PACK format could be useful to store any kind of data in the future (and not especially Git objects).
We implemented git gc
by following the same heuristics as Git to compress pack files and we produce something similar in size — decompress
has a good ratio about compression — and we are using duff
, our own implementation of xdiff
, the binary diff algorithm used by Git (more details on duff
below). We also had to re-implement the streaming algorithm to reconstruct idx
files on the fly, when receiving pack file on the network.
One notable feature of our compression algorithms is they work without the assumption that the underlying system implements POSIX: hence, they can work fully in-memory, in a browser using web storage or inside a MirageOS unikernel with wodan.
Production-ready push and pull
We re-implemented and abstracted the Git Smart protocol, and used that abstraction to make git push
and git pull
work over HTTP. By default we provide a cohttp implementation but users can use their own — for instance based on httpaf
. As proof-of-concept, the initial pull-request was created using this new implementation; moreover, we wrote a prototype of a Git client compiled with js_of_ocaml
, which were able to run git pull
over HTTP inside a browser!
Finally, that implementation will allow MirageOS unikernels to synchronize their internal state with external Git stores (hosted for instance on GitHub) using push/pull mechanisms. We also expect to release a server-side implementation of the smart HTTP protocol, so that the state of any unikernel can be inspected via git pull
. Stay tuned for more updates on that topic!
Standalone Dependencies
Below you can find the details of the new stable releases of libraries that are used by ocaml-git
2.0.
optint
and checkseum
In some parts of ocaml-git
, we need to compute a Circular Redundancy Check value. It is 32-bit integer value. optint
provides an abstraction of it but structurally uses an unboxed integer or a boxed int32
value depending on target (32 bit or 64 bit architecture).
checkseum
relies on optint
and provides 3 implementations of CRC:
- Adler32 (used by
zlib
format) - CRC32 (used by
gzip
format andgit
) - CRC32-C (used by
wodan
)
checkseum
uses the linking trick: this means that users of the library program against an abstract API (only the cmi
is provided); at link-time, users have to select which implementation to use: checkseum.c
(the C implementation) or checkseum.ocaml
(the OCaml implementation). The process is currently a bit cumbersome but upcoming dune
release will make that process much more transparent to the users.
encore
(/angkor/)
In git
, we work with Git objects (tree, blob or commit). These objects are encoded in a specific format. Then, the hash of these objects are computed from the encoded result to get a unique identifier. For example, the hash of your last commit is: sha1(encode(commit))
.
A common operation in git
is to decode Git objects from an encoded representation of them (especially in .git/objects/*
as a loose file) and restore them in another part of your Git repository (like in a PACK file or on the command-line).
Hence, we need to ensure that encoding is always deterministic, and that decoding an encoded Git object is always the identity, e.g. there is an isomorphism between the decoder and the encoder.
let decoder <.> encoder : value -> value = id
let encoder <.> decoder : string -> string = id
encore is a library in which you can describe a format (like Git format) and from it, we can derive a streaming decoder and encoder that are isomorphic by construction.
duff
duff
is a pure implementation in OCaml of the xdiff
algorithm. Git has an optimized representation of your Git repository. It’s a PACK file. This format uses a binary diff algorithm called xdiff
to compress binary data. xdiff
takes a source A and a target B and try to find common sub-strings between A and B.
This is done by a Rabin’s fingerprint of the source A applied to the target B. The fingerprint can then be used to produce a lightweight representation of B in terms of sub-strings of A.
duff
implements this algorithm (with additional Git’s constraints, regarding the size of the sliding windows) in OCaml. It provides a small binary xduff
that complies with the format of Git without the zlib
layer.
$ xduff diff source target > target.xduff
$ xduff patch source < target.xduff > target.new
$ diff target target.new
$ echo $?
0
decompress
decompress
is a pure implementation in OCaml of zlib
and rfc1951
. You can compress and decompress data flows and, obviously, Git does this compression in loose files and PACK files.
It provides a non-blocking interface and is easily usable in a server context. Indeed, the implementation never allocates and only relies on what the user provides (window
, input and output buffer). Then, the distribution provides an easy example of how to use decompress
:
val inflate: ?level:int -> string -> string
val deflate: string -> string
digestif
digestif
is a toolbox providing many implementations of hash algorithms such as:
- MD5
- SHA1
- SHA224
- SHA256
- SHA384
- SHA512
- BLAKE2B
- BLAKE2S
- RIPEMD160
Like checkseum
, digestif
uses the linking trick too: from a shared interface, it provides 2 implementations, in C (digestif.c
) and OCaml (digestif.ocaml
).
Regarding Git, we use the SHA1 implementation and we are ready to migrate ocaml-git
to BLAKE2{B,S} as the Git core team expects - and, in the OCaml world, it is just a functor application with another implementation.
eqaf
Some applications require that secret values are compared in constant time. Functions like String.equal
do not have this property, so we have decided to provide a small package — eqaf — providing a constant-time equal
function. digestif
uses it to check equality of hashes — it also exposes unsafe_compare
if you don’t care about timing attacks in your application.
Of course, the biggest work on this package is not about the implementation of the equal
function but a way to check the constant-time assumption on this function. Using this, we did a benchmark on Linux, Windows and Mac to check it.
An interesting fact is that after various experiments, we replaced the initial implementation in C (extracted from OpenBSD’s timingsafe_memcmp) with an OCaml implementation behaving in a much more predictable way on all the tested platforms.
Conclusion
The upcoming version 2.0 of Irmin is using ocaml-git to create small applications that push and pull their state to GitHub. We think that Git offers a very nice model to persist data for distributed applications and we hope that more people will use ocaml-git to experiment and manipulate application data in Git. Please send us your feedback!