I’m happy to announce the first release of compsort, a tool to reorder files in an archive for better compression. It works by grouping files according to a distance that is computed between every file pair. You can install it with opam install compsort
(requires ocaml 5.2 for parallelism).
Website with more details and examples in README.md
, plus source: adrien / compsort · GitLab
The goal is not new but, AFAIK, the approach is. I am very interested in prior or related art!
Results
Compsort achieves improvements that would typically require larger compression windows and therefore more memory. The improvements are only a few percents but in the domain of compression, a few percents is actually a lot.
With xz
compression, a Ubuntu initrd on my machine is reduced by more than 11.5%, while the best achievable improvement is 12.7% (the reordering gives 90% of the best result). Similarly, the tree of linux-firmware.git
can be compressed 6% better, while the best achievable improvement is 9.4% (the reordering gives 63% of the best result).
Visualizations
In order to better explain what it does, I quite like the visualizations I have so far (there will be better ones in the future), where the value of the pixel at (x,y)
indicates how similar are files x
and y
.
Before:
After reordering:
Files that are very different from others are all packed at the end and there’s also an isolated cluster of files together similar but different from everything else.
One can also see that the distinct row and column pattern from the first picture has disappeared: it indicated that every 15 or so files in that region were similar and were separated by disimilar files but that they’re now grouped.
While there are certainly improvements possible, results are good. It’s a case where one might wonder why results are so good considering all the approximations that took place.
[1] Most of the algorithms/libraries I’ve tried to use rely on having an actual proper distance function which I don’t have
Future work
I’ll try to improve the distance function. Currently it does some steps of compression algorithms to detect redundancies but maybe reusing a compression library would give better results if it can be made fast enough (lz4 is borderline but it has tiny dictionaries unfortunately).
Clustering could be better as the current algorithm is very basic (it collects files that are 90% similar, then 80% similar, then 70%, …). I tried several algorithms but I don’t have a good-enough distance function (for instance the triangular inequality probably doesn’t hold) and results were worse.
All this will benefit from better visualizations and I’d like to have interactive plots that can be hovered on with the mouse to get the distance value and full file name.
Oh and code isn’t always pretty as it went through a lot of experimental stages and low-level tweaks to improve performance.