[ANN] First release of data-encoding, JSON and binary serialisation

On behalf of Nomadic Labs, I’m pleased to announce the first public release of data-encoding: a library to encode and decode values to JSON or binary format. data-encoding provides fine grained control over the representation of the data, documentation generation, and detailed encoding/decoding errors.

In the Tezos project, we use data-encoding for binary serialisation and deserialisation of data transported via the P2P layer and for JSON serialisation and deserialisation of configuration values stored on disk.

The library is available through opam (opam install data-encoding), hosted on Gitlab (https://gitlab.com/nomadic-labs/data-encoding), and distributed under MIT license.

This release was only possible following an effort to refactor our internal tools and libraries. Most of the credit for this effort goes to Pietro Abate and Pierre Boutillier. Additional thanks to Gabriel Scherer who discovered multiple bugs and contributed the original crowbar tests.

Planned future improvements of the library include:

  • splitting the library into smaller components (to minimise dependencies when using only a part of the library), and
  • providing multiple endianness (currently the library only provides big-endian binary encodings).
8 Likes

Is this faster and/or safer than stdlib’s Marshal?

The short answer

I’m not aware of benchmarking comparing it to stdlib’s marshal. If I had to guess I’d say it’s probably slower in most cases.

The longer answer

The library is intended to give fine control over the representation of data. To this end, you first assemble a t encoding for the type t you want to serialise. The performance of the serialisation depends heavily on how you use the provided combinators to assemble that encoding. E.g., you can represent all integers in the native machine format (which is probably faster) or, provided your code has the right invariants, you can represent some integers in a fraction of the space (say int8, or int16).

Because the library gives you a lot of control, you can choose where you want to be on the space-vs-time tradeoff. You can include a lot of dynamic check right in the encoding, or you can leave that responsibility to the caller.


All in all, it’s a pretty different library. You could use combinators to define an encoding that’s compatible with a serialising library written in a different language. You could use a standard compression algorithm as part of an encoding. You could manually align the encoded data on an arbitrary boundary so that it is split cleanly by network packets so that partial decoding works better.

In short: you have more control over both the representation of the data and the process of serialisation/deserialisation, so you can decide to go fast (although I’m not sure you can go as fast as Marshal) or go compact or something else.

1 Like

@raphael-proust only talked about performance; I haven’t dug very far into data-encoding's implementation, but I assume that it doesn’t carry the risk of segfaults or similar failures, which can happen when using Marshal to deserialize data with a layout that does not match the desired type.

That’s correct. When trying to deserialise, the failure modes of deserialisation are more informative:

  type read_error =
    | Not_enough_data
    | Extra_bytes
    | No_case_matched
    | Unexpected_tag of int
    | Invalid_size of int
    | Invalid_int of {min : int; v : int; max : int}
    | Invalid_float of {min : float; v : float; max : float}
    | Trailing_zero
    | Size_limit_exceeded
    | List_too_long
    | Array_too_long

  exception Read_error of read_error

  val pp_read_error : Format.formatter -> read_error -> unit

Note also that the library includes some safe versions of deserialisation functions that return option values.

  (** [of_bytes enc buf] is equivalent to [read enc buf 0 (length buf)].
      The function returns [None] if the buffer is not fully read. *)
  val of_bytes : 'a Encoding.t -> Bytes.t -> 'a option

  (** [of_bytes_exn enc buf] is equivalent to [of_bytes], except
      @raise [Read_error] instead of returning [None] in case of error. *)
  val of_bytes_exn : 'a Encoding.t -> Bytes.t -> 'a

Note however that it is possible for users of the library to create encodings that can fail with a different exception. Specifically, because users have control over the deserialisation process, they are free to raise exceptions when appropriate to their own deserialisation process.

I wonder why not have val of_bytes : 'a Encoding.t -> Bytes.t -> ('a, read_error) result since you already spent the time to construct these nice errors!? (and there’s Result.to_opton in the stdlib, but the other way around – once you’ve None you can’t recover the read_error.)

This is due to a series of historical design decisions. Mostly it has to do with the interplay between the error monad and the data-encoding library in the original project:

Error management in the Tezos project uses an error monad (not yet released, but possibly some time in the future). And it turns out that having encodings for errors was more useful than having errors for encodings: so the error monad depends on the data-encoding library but not vice versa. And so the data-encoding used very bare-bones error-management.


I’ll add this feature to the data-encoding library. And I’ll include it in the v0.2 release.

1 Like