[ANN] Compsort - reorder files in archives for improved compression

I’m happy to announce the first release of compsort, a tool to reorder files in an archive for better compression. It works by grouping files according to a distance that is computed between every file pair. You can install it with opam install compsort (requires ocaml 5.2 for parallelism).

Website with more details and examples in README.md, plus source: adrien / compsort · GitLab

The goal is not new but, AFAIK, the approach is. I am very interested in prior or related art!

Results

Compsort achieves improvements that would typically require larger compression windows and therefore more memory. The improvements are only a few percents but in the domain of compression, a few percents is actually a lot.

With xz compression, a Ubuntu initrd on my machine is reduced by more than 11.5%, while the best achievable improvement is 12.7% (the reordering gives 90% of the best result). Similarly, the tree of linux-firmware.git can be compressed 6% better, while the best achievable improvement is 9.4% (the reordering gives 63% of the best result).

Visualizations

In order to better explain what it does, I quite like the visualizations I have so far (there will be better ones in the future), where the value of the pixel at (x,y) indicates how similar are files x and y.

Before:

After reordering:

Files that are very different from others are all packed at the end and there’s also an isolated cluster of files together similar but different from everything else.
One can also see that the distinct row and column pattern from the first picture has disappeared: it indicated that every 15 or so files in that region were similar and were separated by disimilar files but that they’re now grouped.

While there are certainly improvements possible, results are good. It’s a case where one might wonder why results are so good considering all the approximations that took place.

[1] Most of the algorithms/libraries I’ve tried to use rely on having an actual proper distance function which I don’t have

Future work

I’ll try to improve the distance function. Currently it does some steps of compression algorithms to detect redundancies but maybe reusing a compression library would give better results if it can be made fast enough (lz4 is borderline but it has tiny dictionaries unfortunately).

Clustering could be better as the current algorithm is very basic (it collects files that are 90% similar, then 80% similar, then 70%, …). I tried several algorithms but I don’t have a good-enough distance function (for instance the triangular inequality probably doesn’t hold) and results were worse.

All this will benefit from better visualizations and I’d like to have interactive plots that can be hovered on with the mouse to get the distance value and full file name.

Oh and code isn’t always pretty as it went through a lot of experimental stages and low-level tweaks to improve performance.

18 Likes

Very cool! Out of curiosity, who is the intended userbase for this tool, is it mostly for your own usage or do you have concrete plans to push it in some communities, for example distributions/packagers?

Thanks. The intended userbase is distros, possibly embedded setups too. The initial motivation was to compress firmware files better in linux initrds. There is a lot of duplication in these due to often having one file per hardware generations or model but it’s difficult to tell how much and how it is spread.

I first tried to sort files by name but there is no general rule: AMD uses ${hardware_gen}_${component}.bin and files should be sorted according to component but for others, the best sort would be a purely alphabetical one. I think filenames can reach 4 or 5 independant variables which makes guesswork or intuition impossible.

Having a tool to compute that makes it possible to at least inspect the results. Using the results for initrds is not obvious however due to the size of the data (hundreds of MBs) but I think there are ways if we accept not to have the optimal solution (maybe that in the future we can devise near-exhaustive rules).

A better visualization is needed to determine the future developments because typical datasets have thousands of files and that means millions of data points.

By the way, it’s possible the best conclusion is to increase your compressor’s dictionary size. There are tradeoffs for CPU and memory usage which people are usually unaware of and I would be happy if this helped raised awareness of the corresponding settings.

One final motivation was also to be able to finish the project and be able to publish it as I got interesting results last year already but use by others was complex.

As for why publish now: I’m really looking for related art. :slight_smile:

1 Like

By curiosity, which clustering algorithms did you try? Your descriptions sound like a variant of agglomerative hierarchical clustering, and from afar I would expect other hierarchical clustering variants to also work.

I tried agglomerative clustering with the prbnmcn-clustering package and it gave OK results but there were cases where I would have expected good grouping but the visualizations didn’t show that. In particular, one of my test datasets (dubbed rpi4-initrd) was showing a small but noticeable size increase.

The algorithm I devise is a quick and greedy algorithm which turned out better and the rpi4-initrd increase was very small. It’s indeed the same rough ideas as agglomerative hierarchical clustering; I can’t yet explain for sure why it works better. I had more ideas but this was good enough already.

One reason for all of this is that my distance function is maybe sometimes weird: I was expecting visualizations to look like this distance matrix (taken at random on the Internet) but there is a lot of “noise” all around the pictures.

This reminds me that the distance function may not be commutative. This sounds like a bug but I’m not 100% sure it is. In any case, this is likely to cause issues with most algorithms. The current distance computation is unfortunately very heavy already.

Once I have good visualizations, I’ll experiment with a function dist(x, y) = compress(x+y) / (compress(x) + compress(y) - compress(x+y) (a Jaccard index) where compress is either xz or zstd. I would have used lz4 because it compresses repetitions (what I want to count) but doesn’t encode them (which is noise for me) but its compression window is too small to spot repetitions a few megabytes apart. Unfortunately there are no OCaml bindings for these compression libraries with enough API coverage at the moment.
Hopefully, I’ll then have a distance function that I am sure is reliable and meaningful and I can try with various clustering algorithms again.

Another way to compute better distances may be to replicate what was done when I was hooking into xz’ match finder and compressing everything at once, but in my own code. That means turning the current match finder in a production-grade implementation which requires some work but may be easy enough and fast since its time complexity would be linear instead of the current quadratic thing (OTOH, doing it file by file means the process is incremental which is useful for actual use).

tl;dr: hierarchical clustering with prbnmcn-clustering gave some sub-optimal results, maybe because the distance function has issues (maybe assymetrical mostly); I’m thinking about improving the distance function but that requires more work; this can go in many directions.

This is interesting! Have you considered taking into account the archive format headers when comparing files? E.g. the cpio or tar header. Would that affect the result (much)?

The current implementation is generic with regards to the file contents and structure. One hope here is that if there are specific file properties, they are found automatically and maybe then we can look at the classifications which have been made and come up with criteria that achieve similar results but are less intensive computationally.

How files are stored can also have an impact on compression but I had to limit myself to something that is somewhat reasonable. I think it would affect the results only slightly.

By the way, I’ve started using plotly for visualization. I had to hack the bindings to do a few things I needed (like setting a colormap) but it’s already showing interesting things (distance is sometimes wrong but overall behavior looks solid). The plots are pretty cool but it’s not obvious how to share them as they’re interactive web pages and the result of data sent from a local web server combined with scripts from cdnjs.cloudflare.com

Sorry, maybe I misunderstand or I wasn’t clear. If we are looking at a directory (or archive) of files to reorder and put into an archive the resulting archive will, in the case of tar and cpio, have files prepended with a small-ish header. My thinking was that file A || file B compresses differently than header for file A || file A (|| maybe padding) || header for file B || file B (|| maybe padding). But maybe you are saying this difference is insignificant?

I know very little about compression, but I think this idea is really fun! :smiley:

I wasn’t completely sure what you meant indeed.

I think the idea is interesting but it’s also a lot of work that I guess is different even though it could maybe reuse some what I presented. It would probably mostly matter only for smaller files but smaller files can add up and skew results, especially considering that an entry in tar is actually at least 500 bytes (a lot of fields and fixed sizes of 255 bytes for the path).

This reminds me there can also be considerations related to window size: it may be better to spread files a bit, provided they still fit in the compressor’s window in order to avoid pushing some files outside of the relevant window. For instance, with files A, B, C, X, Y, Z, if A, B and C compress very well, and X, Y and Z also compress very well, and A and X, B and Y, C and Z compress moderately well, the current approach would attempt to do ABCZYX probably, and A and X may end up too far away to get any compression benefit. Doing AXBYCZ might be better.

Unfortunately I haven’t been able to take these two kind consideration into account due to the additional complexity involved. I can imagine looking at these later on but not for now.

PS: I’m pretty sure there are also other archive formats that may be interesting to look at but I don’t know which ones yet! The work here can apply to archives of any kind of data (static or shared libraries? images? containers? file system images?) but again, that’s too much for now. ;p

1 Like