[ANN] Release of Carton 1.0.0 and Cachet

dinosaure · January 14, 2025, 10:43am

I’m delighted to announce the release of Carton 1.0.0 and Cachet (which will be released soon into opam-repository).

Carton is a reimplementation of the Git PACK format. A PACK file is what you can find in your .git/objects/pack in your favourite Git repository. It contains mainly all your Git objects. This format provides a good compression ratio and the ability to extract objects almost directly. It can be seen as a read-only key-value database — in effect, modifying Git objects is impossible.

This project is built around the OCaml implementation of Git that we have. But the PACK format is also interesting in its own right and outside the Git concepts.

The PACK format offers double compression. A zlib compression (proposed by decompress) as well as a compression between objects in the form of a binary patch (proposed by duff).

So, if the “words” appear quite frequently (like the words used in a programming language — if, else, then, etc.), the second level of compression becomes very interesting where an object (such as a file) is simply a succession of patches with other objects.

Cachet, a library for `mmap` syscall

Carton and the PACK format very often use syscall mmap. The point is to be able to take advantage of the kernel cache system to read a PACK file. The kernel can read a file in advance when reading a page via mmap. Basically, the kernel anticipates that you might want to get the next page after the one you requested.

However, in the case of Carton, it is sometimes necessary to ‘go back’, particularly for patched objects whose source is often upstream.

Cachet is an intermediate layer for mmap that caches previously obtained pages. In this way, we take advantage of both the kernel for subsequent pages and our library for previous pages.

Let’s take a concrete example. Carton can analyse a PACK file as git verify-pack does. Let’s make a comparison with and without Cachet.

+--------------+-------------+----------------+-----------------+
|              | with cachet | without cachet | git verify-pack |
+--------------+-------------+----------------+-----------------+
|         time |       17.8s |          41.8s |            9.3s |
+--------------+-------------+----------------+-----------------+
| cache misses |        936M |          1933M |            246M |
+--------------+-------------+----------------+-----------------+

As you can see, using Cachet improves Carton’s execution time. We’re still not as competitive as git-verify-pack, but we’re getting close!

Cachet offers to cache previously loaded pages. Its cache system is very basic and is just a small array whose size is a power of two. Next, we simply reuse the OCaml hash function — in this respect, it may be worth testing another hash function.

Cachet & schedulers

Like most of our projects, Cachet is independent of schedulers. There is therefore a variant with Lwt and a variant with Miou. However, we need to clarify a behaviour related to the use of Cachet. Reading a file, whether with read(3) or mmap(3P), does not block, but it can take some time.

As we have already experienced and explained here, it may be necessary to explain to the scheduler whether it is appropriate to do something else after such a syscall. In the case of Lwt, it might be a good idea to insert Lwt.pause just after our syscall so that Lwt gives another service the opportunity to run despite the time taken trying to read from a file. However, particularly for Lwt, this means closing Cachet in the hell of the monad (in other words, there is no way to escape it) because of this possible Lwt.pause (which returns unit Lwt.t).

The composition of Cachet with Lwt is therefore quite different from what we’ve been able to experiment with. One of our other articles suggests not using functors (too much), and although we can in fact abstract Lwt.t from unit Lwt.t (and even reduce it such that type 'a t = 'a) with the HKP trick, we opted for composition by hand.

The problem relates to Lwt (and Async) and doesn’t apply to Miou when it’s possible to raise effects. However, from such a composition, a choice has been made to give Lwt the opportunity to do something else after mmap. We could, in other types of applications, make another choice on this precise question.

Carton

Carton is a library that was originally developed for ocaml-git. It was internal to the project but we considered that the PACK format’s field of application could be wider than that of Git. We decided to extract the project from ocaml-git and make it a library in its own right. Carton’s objective remains fairly rudimentary. It consists of:

extract objects from a PACK file (whether or not these objects are Git objects)
generate an *.idx file from a PACK file in order to have quick access to the objects
verifying a PACK file such as git verify-pack does
and finally generate a PACK file from a list of objects

Carton is a library and a tool that you can now use on your Git repositories. Here are a few examples of how to use carton. We’ll start by cloning a repository to test Carton and go to the folder containing the PACK file.

$ opam install carton.1.0.0
$ git clone https://github.com/ocaml/ocaml
$ cd ocaml/.git/objects/pack/

Carton can check a PACK file. Verifying means extracting all the objects in the file from memory and calculating their hash. This command is similar to git verify-pack.

$ carton verify pack-*.pack

Carton can extract a specific object (commit, tree or blob) from a PACK file using its associated *.idx file and the object identifier (the hash of the commit, for example).

$ carton get pack-*.idx 89055b054eeec0c6c6b6118d6490b6792da7fef2

Instead of extracting objects from a PACK file into memory, you can also extract them as files using explode.

$ mkdir loose
$ carton explode 'loose/%s/%s' pack-*.pack > entries.pack

Finally, Carton can create a new PACK file from a list of objects stored in files with make. It can also generate the *.idx file associated with the new PACK file. As we’ve just re-packaged the objects in the repository, we should find the same objects.

$ carton make -n $(cat entries.pack | wc -l) -e entries.pack new.pack
$ carton index new.pack
$ carton get new.idx 89055b054eeec0c6c6b6118d6490b6792da7fef2

Please note that the above actions, applied to ocaml/ocaml, may take some time due to the history of this project.

In the example above, we can see the extraction of a Git object, the extraction of all the objects in a PACK file and the creation of a new PACK file based on all the extracted objects.

As you can see, creating a PACK file can take a long time. However, the advantage of the PACK file lies particularly in obtaining the objects and in the rate of compression of the PACK file:

+--------+-------------+----------+-------+--------------+
|        | pack-*.pack | new.pack | loose | loose.tar.gz |
+--------+-------------+----------+-------+--------------+
|   size |        355M |     648M |  8.3G |         1.8G |
+--------+-------------+----------+-------+--------------+

The PACK file is primarily designed to provide access to objects according to their identifiers. This access must be as fast as possible, even if the object is first compressed with decompress and can be compressed in the form of a patch with duff. Here are a few metrics to give you an idea.

+--------------+-------------+----------+---------+
|              | pack-*.pack | new.pack | loose   |
+--------------+-------------+----------+---------+
| git cat-file |     ~ 0.01s |      N/A |     N/A |
+--------------+-------------+----------+---------+
|   carton get |     ~ 0.20s |  ~ 0.30s |         |
+--------------+-------------+----------+---------+
|          cat |         N/A |      N/A | 0.0006s |
+--------------+-------------+----------+---------+

What’s important to note is the ability to have random access to objects simply by having the associated *.idx file, the production of which is quite efficient. This is not or hardly the case for compression formats such as GZip. And that’s the whole point of PACK files, with an indexing method for almost immediate access to objects according to their identifiers and offering a very good compression ratio.

NOTE: Carton does not compress the repository as well as Git. The main reason is that Git has some heuristics relating to Git objects that Carton does not implement - because Carton wishes to be independent of Git concepts. These heuristics apply in particular to the order in which we want to pack objects. In addition, Git prepares the ground so that the antecedents of a blob object (which is a file in your repository), for example, are the old versions of that same blob (and therefore the old versions of your file).

In this context, the patch algorithm implemented by duff applies very well and gives very good results.

For more details on these heuristics, you can read this discussion that serves as documentation.

Carton & parallelism

As always, our libraries are independent of schedulers. There is a version of Carton with Lwt and a version with Miou.

Some of the tasks Carton performs, such as indexing, are highly parallelizable. In this case, the new derivation of Carton with Miou exists to take advantage of the latter’s domain pool.

It was also quite easy to parallelize the work on carton index and carton verify. Here are some other metrics which, thanks to OCaml 5 and Miou, bring us closer to Git performance:

$ hyperfine \
  -n git \
    "git verify-pack pack-03a3a824757ff4c225874557c36d44eefe3d7918.idx" \
  -n carton \
    "carton verify pack-03a3a824757ff4c225874557c36d44eefe3d7918.pack -q --threads 4"
Benchmark 1: git
  Time (mean ± σ):     329.2 ms ±   0.9 ms    [User: 384.2 ms, System: 27.8 ms]
  Range (min … max):   327.7 ms … 330.9 ms    10 runs
 
Benchmark 2: carton
  Time (mean ± σ):     712.1 ms ±  10.9 ms    [User: 1111.8 ms, System: 1112.6 ms]
  Range (min … max):   695.4 ms … 726.8 ms    10 runs
 
Summary
  git ran
    2.16 ± 0.03 times faster than carton

NOTE: it may come as a surprise that Carton is 2 times slower than Git for analysing a PACK file, but it should be noted that almost the entire Carton implementation is in OCaml! At this stage, the idea is more to give you an idea, but we literally find ourselves comparing a Bugatti with a Citroën 2CV.

Carton & Emails

Finally, this in-depth rewrite of Carton allows us to take advantage of the PACK format for storing our emails.

In fact, we are experimenting with and developing an email solution within our cooperative, and email archiving is one of our objectives. Based on our experience of implementing Git, we thought that the PACK format could be a very interesting format for archiving emails.

It combines two features, rapid access to emails and compression by patches, which are very interesting when it comes to handling emails. Finally, it also corresponds more or less to the way we use email:

we don’t want to delete them (more often than not, we want to keep them ad vitam aeternam)
and we don’t modify them

It therefore corresponds to a sort of read-only database. For more details on this aspect of Carton and the results of our experiments, I suggest you read our recent article on our cooperative’s blog.

nojb · January 14, 2025, 11:27am

Thanks for the interesting post!

Incidentally, this sounds very similar to public-inbox public-inbox - an "archives first" approach to mailing lists, which uses the Git repository format to store emails, and is widely used in the Linux community https://lore.kernel.org/.

Cheers,
Nicolas

dinosaure · January 14, 2025, 11:41am

Very interesting! Indeed, I think the basic intuition we had is the same as public-inbox. However, the approach is subtly different. The idea of public-inbox is to store emails in a Git repository (which is much more than a PACK file) and take advantage of what Git can do to store the repository content. The latter therefore takes over the aspects of Git (notably the trees).

As far as we’re concerned, we’re only taking advantage of the PACK file. We don’t use Git’s concepts. We also try to “split” emails to take advantage of the similarities between email headers. This also applies to email content. In our article, we show that our PACK file generation aligns patches between email headers (i.e. we prefer to patch between email headers rather than between blob as Git will do) and patches between email content.

But we should indeed compare the result.

PS: I’ve also tried looking for a caml-list archive and it seems you’ve made one. This could be interesting as a basis for comparison.

nojb · January 14, 2025, 12:32pm

Yes, indeed I used to maintain a public-inbox mirror of the caml-list. Unfortunately, the machine on which it was hosted at some point went away and the mirror with it. However, you can still get an old copy (last updated ~2022) from caml-list - the Caml user's mailing list (it can be cloned with Git, see instructions at public-inbox mirroring instructions).

Cheers,
Nicolas

raphael-proust · January 14, 2025, 12:43pm

That could be a good use-case for @sudha 's Lwt_domain.

R_Huxton · January 14, 2025, 1:48pm

Can I just say “thank you” for the brief introduction to what Carton actually is?

There are so many release messages that just say “FooWurzle 3.2 is out!” with no way to know if that is a javascript library or a flying car.

dinosaure · January 17, 2025, 4:40pm

After some experimentation, we were able to draw a more formal conclusion between what we do and what public-inbox offers. You can see the results of our tests here, with a detailed description: Robur's blog - Git, Carton and emails

Thanks in any case for giving us not only an archive of the caml-list but also real pointers. This enabled us to test on a rather impressive corpus (we’re talking millions of emails!).

nojb · January 17, 2025, 4:56pm

Thanks for following up; interesting write up!

Cheers,
Nicolas

Topic		Replies	Views
[ANN] ocaml-git 2.0 Community mirageos , announce , git	12	2851	October 27, 2018
[ANN] Release of OCaml-Git v3.0, duff, encore, decompress, etc Community irmin , mirageos , announce	0	1461	January 11, 2021
Opam repository archival, phase 3 Ecosystem announce , opam-repository	20	805	July 22, 2025
Opam repository archival, phase 2 - OCaml 4.08 is the lower bound Ecosystem announce , opam-repository	21	812	February 26, 2025
Opam repository archival, Phase 1: unavailable packages Ecosystem announce , opam-repository	6	515	January 1, 2025

[ANN] Release of Carton 1.0.0 and Cachet

Cachet, a library for mmap syscall

Cachet & schedulers

Carton

Carton & parallelism

Carton & Emails

Related topics

Cachet, a library for `mmap` syscall