Is it worth to compress marshaled output?

Hi all.

I’m working on a proof assistant that read and rewrite cache files produced with Marshal module. The file size ranges from 20K~900K.

I wonder if it could benefit from applying compression on these cache files, like what Agda does on .agdai files. Tested with LZ4 command line tools (with default compression configuration) it can achieve about 70% or more file size reduction on my cache files.

However I do not know if there’s an effective way to integrate compression with OCaml’s IO, which seems can not be easily extended without hacking the C runtime. And there did exists an OCaml binding to LZ4 or various implementation of other lz77 family of compression algorithms, and most of them support only compression on bytes data type.

1 Like

Keep things simple. Your data seems small enough to operate in memory. Marshal to bytes, compress the result and write it to disc. On the way back read the file contents, uncompress and unmarshall it from memory.

Thank you, I just coded up a prototype using caml_output_value_to_malloc to marshal objects to C buffer first, call the LZ4 compression function, then use caml/io.h to write into io channel directly (requires #define CAML_INTERNALS though), all done in C (for some reason this project already includes many C code, so I just don’t bother to add another one). Looks like it works good enough.

The package lz4_chans in opam might be of interest.

Since you can marshal to string anything, (the bytes data type is just a mutable string), then you can compress the output of Marshal.
In my experience, Marshal output usually compresses very well.

1 Like

This sounds like a good solution. If you’d rather not write any C code, you can also use any external compression program (below I’m using xz for some serious compression):

# #load "unix.cma";;
# let oc = Unix.open_process_out "xz -z > /tmp/data.xz";;
val oc : out_channel = <abstr>
# output_value oc [1;2;3;4;5];;
- : unit = ()
# Unix.close_process_out oc;;
- : Unix.process_status = Unix.WEXITED 0
# let ic = Unix.open_process_in "xz -d < /tmp/data.xz";;
val ic : in_channel = <abstr>
# let v : int list = input_value ic;;
val v : int list = [1; 2; 3; 4; 5]
# Unix.close_process_in ic;;
- : Unix.process_status = Unix.WEXITED 0

Bindings to the Gzip compression library can also be found in the Camlzip package.

3 Likes