Marshal determinism and stability

Quick notes about this approach:

  • It is used extensively in the Tezos codebase. For data exchange (in the p2p layer), for data at rest (configuration files), and for a mix of the two (serialisation of economic protocol data which is both exchanged by peers and stored on disk).
  • It is flexible in that you have great control over the representation of data and the serialisation/deserialisation procedure. There is a medium-term plan to allow even more control. For now you can decide, say, if 8 booleans are represented as one byte, 8 bytes, or 8 words (or something else altogether) (see code below).
  • Some of the responsibility for correctness rests upon your shoulders as a user. E.g., when you encode a tuple, the left element must have either a fixed length (e.g., be an int8, int32, etc., be a fixed-length string, or be a tuple of fixed-length values) or be prefixed by a length marker (which the library provides a combinator for). Most of the errors for this are raised when you declare the encoding and a few are raised when you use the encoding. I recommend writing some tests to check that your encodings accept the range of values that you are going to throw at them.
  • The library is well tested: there are tests using crowbar to check that encoding and decoding are actual inverse of each others.

Let me know if you have more questions. And in the meantime, here’s two different encodings for a tuple of 8 booleans:

(* easy-encoding, produces 8 bytes *)
let boolsas8bytes =
   tup8 bool bool bool bool bool bool bool bool

(* very-compact encoding, produces 1 byte *)
let boolsas1byte =
   conv
      (fun (b1, b2, b3, b4, b5, b6, b7, b8) ->
         let acc = 0 in
         let acc = if b1 then acc lor 0b10000000 else acc in
         let acc = if b2 then acc lor 0b01000000 else acc in
         let acc = if b3 then acc lor 0b00100000 else acc in
         …
         acc)
      (fun i ->
         let b1 = i land 0b10000000 <> 0 in
         let b1 = i land 0b01000000 <> 0 in
         let b1 = i land 0b00100000 <> 0 in
         …
         (b1, b2, b3, b4, b5, b6, b7, b8))
      uint8

In general, data-encoding is probably slower than marshal, but its strong points are:

  • it offers some type guarantees,
  • it gives you some control over the representation of the data,
  • it allows you to define representations that are easy to parse in other languages or in other versions of the same language,
  • it generates documentation about the data-representation.
2 Likes