[ANN] First release of data-encoding, JSON and binary serialisation

On behalf of Nomadic Labs, I’m pleased to announce the first public release of data-encoding: a library to encode and decode values to JSON or binary format. data-encoding provides fine grained control over the representation of the data, documentation generation, and detailed encoding/decoding errors.

In the Tezos project, we use data-encoding for binary serialisation and deserialisation of data transported via the P2P layer and for JSON serialisation and deserialisation of configuration values stored on disk.

The library is available through opam (opam install data-encoding), hosted on Gitlab (https://gitlab.com/nomadic-labs/data-encoding), and distributed under MIT license.

This release was only possible following an effort to refactor our internal tools and libraries. Most of the credit for this effort goes to Pietro Abate and Pierre Boutillier. Additional thanks to Gabriel Scherer who discovered multiple bugs and contributed the original crowbar tests.

Planned future improvements of the library include:

  • splitting the library into smaller components (to minimise dependencies when using only a part of the library), and
  • providing multiple endianness (currently the library only provides big-endian binary encodings).
8 Likes

Is this faster and/or safer than stdlib’s Marshal?

The short answer

I’m not aware of benchmarking comparing it to stdlib’s marshal. If I had to guess I’d say it’s probably slower in most cases.

The longer answer

The library is intended to give fine control over the representation of data. To this end, you first assemble a t encoding for the type t you want to serialise. The performance of the serialisation depends heavily on how you use the provided combinators to assemble that encoding. E.g., you can represent all integers in the native machine format (which is probably faster) or, provided your code has the right invariants, you can represent some integers in a fraction of the space (say int8, or int16).

Because the library gives you a lot of control, you can choose where you want to be on the space-vs-time tradeoff. You can include a lot of dynamic check right in the encoding, or you can leave that responsibility to the caller.


All in all, it’s a pretty different library. You could use combinators to define an encoding that’s compatible with a serialising library written in a different language. You could use a standard compression algorithm as part of an encoding. You could manually align the encoded data on an arbitrary boundary so that it is split cleanly by network packets so that partial decoding works better.

In short: you have more control over both the representation of the data and the process of serialisation/deserialisation, so you can decide to go fast (although I’m not sure you can go as fast as Marshal) or go compact or something else.