I’m delighted to announce the release of Carton 1.0.0 and Cachet (which will be released soon into opam-repository
).
Carton is a reimplementation of the Git PACK format. A PACK file is what you can find in your .git/objects/pack
in your favourite Git repository. It contains mainly all your Git objects. This format provides a good compression ratio and the ability to extract objects almost directly. It can be seen as a read-only key-value database — in effect, modifying Git objects is impossible.
This project is built around the OCaml implementation of Git that we have. But the PACK format is also interesting in its own right and outside the Git concepts.
The PACK format offers double compression. A zlib compression (proposed by decompress) as well as a compression between objects in the form of a binary patch (proposed by duff).
So, if the “words” appear quite frequently (like the words used in a programming language — if, else, then, etc.), the second level of compression becomes very interesting where an object (such as a file) is simply a succession of patches with other objects.
Cachet, a library for mmap
syscall
Carton and the PACK format very often use syscall mmap
. The point is to be able to take advantage of the kernel cache system to read a PACK file. The kernel can read a file in advance when reading a page via mmap
. Basically, the kernel anticipates that you might want to get the next page after the one you requested.
However, in the case of Carton, it is sometimes necessary to ‘go back’, particularly for patched objects whose source is often upstream.
Cachet is an intermediate layer for mmap
that caches previously obtained pages. In this way, we take advantage of both the kernel for subsequent pages and our library for previous pages.
Let’s take a concrete example. Carton can analyse a PACK file as git verify-pack
does. Let’s make a comparison with and without Cachet.
+--------------+-------------+----------------+-----------------+
| | with cachet | without cachet | git verify-pack |
+--------------+-------------+----------------+-----------------+
| time | 17.8s | 41.8s | 9.3s |
+--------------+-------------+----------------+-----------------+
| cache misses | 936M | 1933M | 246M |
+--------------+-------------+----------------+-----------------+
As you can see, using Cachet improves Carton’s execution time. We’re still not as competitive as git-verify-pack, but we’re getting close!
Cachet offers to cache previously loaded pages. Its cache system is very basic and is just a small array whose size is a power of two. Next, we simply reuse the OCaml hash function — in this respect, it may be worth testing another hash function.
Cachet & schedulers
Like most of our projects, Cachet is independent of schedulers. There is therefore a variant with Lwt and a variant with Miou. However, we need to clarify a behaviour related to the use of Cachet. Reading a file, whether with read(3)
or mmap(3P)
, does not block, but it can take some time.
As we have already experienced and explained here, it may be necessary to explain to the scheduler whether it is appropriate to do something else after such a syscall. In the case of Lwt, it might be a good idea to insert Lwt.pause
just after our syscall so that Lwt gives another service the opportunity to run despite the time taken trying to read from a file. However, particularly for Lwt, this means closing Cachet in the hell of the monad (in other words, there is no way to escape it) because of this possible Lwt.pause
(which returns unit Lwt.t
).
The composition of Cachet with Lwt is therefore quite different from what we’ve been able to experiment with. One of our other articles suggests not using functors (too much), and although we can in fact abstract Lwt.t
from unit Lwt.t
(and even reduce it such that type 'a t = 'a
) with the HKP trick, we opted for composition by hand.
The problem relates to Lwt (and Async) and doesn’t apply to Miou when it’s possible to raise effects. However, from such a composition, a choice has been made to give Lwt the opportunity to do something else after mmap
. We could, in other types of applications, make another choice on this precise question.
Carton
Carton is a library that was originally developed for ocaml-git. It was internal to the project but we considered that the PACK format’s field of application could be wider than that of Git. We decided to extract the project from ocaml-git
and make it a library in its own right. Carton’s objective remains fairly rudimentary. It consists of:
- extract objects from a PACK file (whether or not these objects are Git objects)
- generate an
*.idx
file from a PACK file in order to have quick access to the objects - verifying a PACK file such as
git verify-pack
does - and finally generate a PACK file from a list of objects
Carton is a library and a tool that you can now use on your Git repositories. Here are a few examples of how to use carton
. We’ll start by cloning a repository to test Carton and go to the folder containing the PACK file.
$ opam install carton.1.0.0
$ git clone https://github.com/ocaml/ocaml
$ cd ocaml/.git/objects/pack/
Carton can check a PACK file. Verifying means extracting all the objects in the file from memory and calculating their hash. This command is similar to git verify-pack
.
$ carton verify pack-*.pack
Carton can extract a specific object (commit, tree or blob) from a PACK file using its associated *.idx
file and the object identifier (the hash of the commit, for example).
$ carton get pack-*.idx 89055b054eeec0c6c6b6118d6490b6792da7fef2
Instead of extracting objects from a PACK file into memory, you can also extract them as files using explode
.
$ mkdir loose
$ carton explode 'loose/%s/%s' pack-*.pack > entries.pack
Finally, Carton can create a new PACK file from a list of objects stored in files with make. It can also generate the *.idx
file associated with the new PACK file. As we’ve just re-packaged the objects in the repository, we should find the same objects.
$ carton make -n $(cat entries.pack | wc -l) -e entries.pack new.pack
$ carton index new.pack
$ carton get new.idx 89055b054eeec0c6c6b6118d6490b6792da7fef2
Please note that the above actions, applied to ocaml/ocaml
, may take some time due to the history of this project.
In the example above, we can see the extraction of a Git object, the extraction of all the objects in a PACK file and the creation of a new PACK file based on all the extracted objects.
As you can see, creating a PACK file can take a long time. However, the advantage of the PACK file lies particularly in obtaining the objects and in the rate of compression of the PACK file:
+--------+-------------+----------+-------+--------------+
| | pack-*.pack | new.pack | loose | loose.tar.gz |
+--------+-------------+----------+-------+--------------+
| size | 355M | 648M | 8.3G | 1.8G |
+--------+-------------+----------+-------+--------------+
The PACK file is primarily designed to provide access to objects according to their identifiers. This access must be as fast as possible, even if the object is first compressed with decompress and can be compressed in the form of a patch with duff. Here are a few metrics to give you an idea.
+--------------+-------------+----------+---------+
| | pack-*.pack | new.pack | loose |
+--------------+-------------+----------+---------+
| git cat-file | ~ 0.01s | N/A | N/A |
+--------------+-------------+----------+---------+
| carton get | ~ 0.20s | ~ 0.30s | |
+--------------+-------------+----------+---------+
| cat | N/A | N/A | 0.0006s |
+--------------+-------------+----------+---------+
What’s important to note is the ability to have random access to objects simply by having the associated *.idx
file, the production of which is quite efficient. This is not or hardly the case for compression formats such as GZip. And that’s the whole point of PACK files, with an indexing method for almost immediate access to objects according to their identifiers and offering a very good compression ratio.
NOTE: Carton does not compress the repository as well as Git. The main reason is that Git has some heuristics relating to Git objects that Carton does not implement - because Carton wishes to be independent of Git concepts. These heuristics apply in particular to the order in which we want to pack objects. In addition, Git prepares the ground so that the antecedents of a blob object (which is a file in your repository), for example, are the old versions of that same blob (and therefore the old versions of your file).
In this context, the patch algorithm implemented by duff applies very well and gives very good results.
For more details on these heuristics, you can read this discussion that serves as documentation.
Carton & parallelism
As always, our libraries are independent of schedulers. There is a version of Carton with Lwt and a version with Miou.
Some of the tasks Carton performs, such as indexing, are highly parallelizable. In this case, the new derivation of Carton with Miou exists to take advantage of the latter’s domain pool.
It was also quite easy to parallelize the work on carton index
and carton verify
. Here are some other metrics which, thanks to OCaml 5 and Miou, bring us closer to Git performance:
$ hyperfine \
-n git \
"git verify-pack pack-03a3a824757ff4c225874557c36d44eefe3d7918.idx" \
-n carton \
"carton verify pack-03a3a824757ff4c225874557c36d44eefe3d7918.pack -q --threads 4"
Benchmark 1: git
Time (mean ± σ): 329.2 ms ± 0.9 ms [User: 384.2 ms, System: 27.8 ms]
Range (min … max): 327.7 ms … 330.9 ms 10 runs
Benchmark 2: carton
Time (mean ± σ): 712.1 ms ± 10.9 ms [User: 1111.8 ms, System: 1112.6 ms]
Range (min … max): 695.4 ms … 726.8 ms 10 runs
Summary
git ran
2.16 ± 0.03 times faster than carton
NOTE: it may come as a surprise that Carton is 2 times slower than Git for analysing a PACK file, but it should be noted that almost the entire Carton implementation is in OCaml! At this stage, the idea is more to give you an idea, but we literally find ourselves comparing a Bugatti with a Citroën 2CV.
Carton & Emails
Finally, this in-depth rewrite of Carton allows us to take advantage of the PACK format for storing our emails.
In fact, we are experimenting with and developing an email solution within our cooperative, and email archiving is one of our objectives. Based on our experience of implementing Git, we thought that the PACK format could be a very interesting format for archiving emails.
It combines two features, rapid access to emails and compression by patches, which are very interesting when it comes to handling emails. Finally, it also corresponds more or less to the way we use email:
- we don’t want to delete them (more often than not, we want to keep them ad vitam aeternam)
- and we don’t modify them
It therefore corresponds to a sort of read-only database. For more details on this aspect of Carton and the results of our experiments, I suggest you read our recent article on our cooperative’s blog.