Libraries for parsing binary formats?

What is a good way to parse binary file formats with OCaml? I’ve seen both bitstring and cstruct but assume there are more. Typical tasks involve decoding bit fields, reading a length field in order to know how much to read next, reading 8, 16, 32, and 64-bit values. In addition to libraries I would be interested in projects that use them.

1 Like

I use cstruct, with some patches (for now) for supporting bitfields: https://github.com/mirage/ocaml-cstruct/pull/215

I should note it is still WIP and very brittle.

1 Like

You can also use something like https://github.com/OCamlPro/ocplib-endian if you want to keep it simple.

1 Like

[trigger warning: camlp5 ahead] This is going to be about de/marshaling binary wire formats and not about binary data in-memory. I no longer have access to the code, so this comment is at best a “trip report”. Perhaps not so useful.

I’ve had to do this a few times: decoding ethernet/IP/TCP/CORBA packets and payloads (needed programmatic dissectors that could do statistical analyses of response-times, and hadn’t found the tools which capture/dissect TCP and dump into files for further analysis – to diagnose bugs in multi-tier server-side applications). Another time, I wanted to decode/re-encode Java classfiles (to perform aspect-oriented injection of code into existing Java classfiles). There have been others. I find that the stream-based parsers-printers of camlp5 are a -great- tool.

Almost all binary wire-formats that I’ve dealt with turn out to be recursive-descent, and so writing de/marshallers for them with stream (LL(1)) parsers/printers is a doddle.

(1) of course, stuff like “read a length-count”, demarshal N of some type" is trivial

(2) eventually you might want to (re-)marshal. One of the nice things about camlp5 is that pretty-printers look superficially like parsers, which makes it much easier to keep the marshaller in sync with the parser.

(3) Eventually one thing jumps out, that doesn’t fit into stream-parsers: “skip forward to an N-byte alignment boundary”. So I re-implemented -only- those parts of the Stream module that were used by generated parsers, with aStream.t type implemented using a buffer, viz.

module Stream = struct

  type 'a t = {mutable buf : 'a array;
	       mutable buf_pos : int;
	       mutable buf_bound : int;
	       mutable abs_ofs : int;
	       eof : 'a ;
	       filbuf : 'a t -> int}

and then one can write functions that return the absolute position in the “stream”, etc.

(4) of course, switching to this “flat” representation of the parsing-buffer also improves efficiency dramatically.

This replacement of Stream.t with something custom pretty much corresponds to “write a hacked lexer with interesting behaviour” that we do all the time when it comes to writing parsers, of course.

Unfortunately, I wrote all this code for IBM back in 2001, and they had pretty unenlightened policies about open-source. All lost in the sands of time.

Is it possible to liberate the interesting parser tools from camlp5 so that they’re more standalone?

not really. Camlp5 is a system for building PreProcessors and PrettyPrinters (used to be called camlp4 for obvs. reasons), and before it was written, DDR wrote stream-parsers/printers as a preprocessor for caml-light. He wrote camlp4 (used to be called chamau) as a generalization of the idea, and the stream-grammar-extension was (IIRC) the first implemented use of chamau. That was all 25 years ago. At this point, if you want stream-{parsers,printers} you need to use camlp{4,5} to get 'em.

But I routinely use camlp{4,5} on parts of my code for a project, and PPX extensions in other parts. They’re incompatible at the source-code level, but there’s no reason you can’t mix 'em as binaries.

There is also Angstrom, but I don’t know if it is possible to make it work on non-byte granularity, as in a sequence of bits.

I remember reading about using bitstream not (too) long ago. I have just found the link: https://andreas.github.io/2014/08/22/implementing-the-binary-memcached-protocol-with-ocaml-and-bitstring

There is also parsifal. It seems maintained but I have no experience with it (and I don’t know if there is a usable library part or it only provides the binaries). I only remember vaguely the papers: http://spw14.langsec.org/papers/pasifal-report.pdf and https://www.ieee-security.org/TC/SPW2014/papers/5103a191.PDF

1 Like

Thanks for all the recommendation. So far, I like bitstring and I wonder why it is not more widely used. I believe a bitstring-based parser works best in a monadic style that chains individual parsers together. These combinators are not part of bitstring, but they are easy to implement.

2 Likes

This is handled in the Cf_decoder module in my Orsetto project. Its CBOR decoder uses it.