[ANN] ocaml-wire: a Binary wire format DSL with EverParse 3D output

I’m happy to announce the release of ocaml-wire 0.9.0 to opam!

ocaml-wire is a set of combinators to describe binary protocols in OCaml. “Binary” as in fixed-layout: sizes and shapes are mostly known up front, so you’re reading fields at known offsets rather than parsing a free-form grammar (like in Menhir). There are already many good binary parser frameworks in OCaml – here’s what’s different in ocaml-wire:

  • The central data structure is EverParse’s 3D grammar. The OCaml combinators describe a 3D structure. Following Daniel’s tagged final encoding approach, you build a codec that describes a struct and how to serialise/deserialise it from an OCaml value.
  • Which means every description can compile down to a .3d file and from there to C via EverParse – and the generated C compiles cleanly with -std=c99 -Wall -Werror. So you could, in theory (I haven’t pushed on this too far for now), build and link a complex stack as an standalone, external C parser whose memory safety, single-pass behaviour, and conformance to the 3D spec are proven in F*(EverParse doesn’t give you serialisers, unfortunately).
  • If you prefer, you can still use the OCaml parsers/verifiers, which have been made reasonably fast (i.e. no extra allocations on the hot path), and as they are streaming codecs you can just read or write a single field of the structure without parsing the whole thing. They also compile trivially to JavaScript via js_of_ocaml, if you need to parse the same format in the browser.

A small example – a packet with two nibble-wide fields, a big-endian length, and a variable-size payload whose length is read from the Length field:

open Wire

type packet = { version : int; flags : int; length : int; payload : string }

let f_version = Field.v "Version" (bits ~width:4 U8)
let f_flags   = Field.v "Flags"   (bits ~width:4 U8)
let f_length  = Field.v "Length"  uint16be
let f_payload = Field.v "Payload" (byte_array ~size:(Field.ref f_length))

let codec =
  let open Codec in
  v "Packet" (fun version flags length payload -> { version; flags; length; payload })
    [ f_version $ (fun p -> p.version);
      f_flags   $ (fun p -> p.flags);
      f_length  $ (fun p -> p.length);
      f_payload $ (fun p -> p.payload) ]

Field.ref f_length is the dependent-size bit: the payload’s length comes from the f_length field read earlier in the same struct. That same codec value is what you hand to Everparse.schema to get the 3D file out, and what backs Codec.get / Codec.set for streaming field access from OCaml. ocaml-wire also ships the infrastructure for generating FFI stubs and for differential testing to make sure the verified C and hand-written OCaml parsers agree. So far they seem to :slight_smile:

You can read more about it here: Thomas Gazagnaire :: Describing Binary Formats in OCaml.

To try it: opam install wire

Feedback very welcome on the issue tracker (or bellow)!

14 Likes

I am not familiar with the 3D formalism. What stands out is that all integers are unsigned, which reflect a bit-level view. But how can these be interpreted as signed integers or floats? Is this left to the layers above?

Yes, this is left to the layer above. The split in ocaml-wire today:

  • Generated C (EverParse): only sees U8, U16(BE), U32(BE), U64(BE). Checks bounds, tags, and length relations. This is the part that gets formally verified, and keeping it unsigned keeps the F* proofs tractable.

  • OCaml combinator layer: exposes exactly what 3D can express: the unsigned primitives, structs, bitfields, tagged unions, and dependent-size arrays. If you need a field as signed int32 or an IEEE-754 float, the conversion lives in the record constructor and per-field projection you hand to Codec.v (Int32.float_of_bits on decode, Int32.bits_of_float on encode for instance).

So the verified parser only sees the unsigned bit patterns. The signed/float interpretation is an OCaml-side convenience that EverParse has no knowledge of, so the F* proof covers memory safety, bounds, tag validity, and length relations, but not “these four bytes are a well-formed float” in any numeric sense. If you do want a semantic constraint verified, you express it at the bit level, as a constraint on the unsigned word. Tag sets and range checks drop out naturally (Field.ref f < int 0x8000_0000, or an enum disjunction); more elaborate predicates like “not a NaN” might be expressible as bitwise constraints on the exponent and mantissa, though I haven’t tried yet…

That might be a good idea to expose those bitwise constraints as first-class combinators in the DSL!