[ANN] First release of data-encoding, JSON and binary serialisation

On behalf of Nomadic Labs, I’m pleased to announce the first public release of data-encoding: a library to encode and decode values to JSON or binary format. data-encoding provides fine grained control over the representation of the data, documentation generation, and detailed encoding/decoding errors.

In the Tezos project, we use data-encoding for binary serialisation and deserialisation of data transported via the P2P layer and for JSON serialisation and deserialisation of configuration values stored on disk.

The library is available through opam (opam install data-encoding), hosted on Gitlab (https://gitlab.com/nomadic-labs/data-encoding), and distributed under MIT license.

This release was only possible following an effort to refactor our internal tools and libraries. Most of the credit for this effort goes to Pietro Abate and Pierre Boutillier. Additional thanks to Gabriel Scherer who discovered multiple bugs and contributed the original crowbar tests.

Planned future improvements of the library include:

  • splitting the library into smaller components (to minimise dependencies when using only a part of the library), and
  • providing multiple endianness (currently the library only provides big-endian binary encodings).

Is this faster and/or safer than stdlib’s Marshal?

The short answer

I’m not aware of benchmarking comparing it to stdlib’s marshal. If I had to guess I’d say it’s probably slower in most cases.

The longer answer

The library is intended to give fine control over the representation of data. To this end, you first assemble a t encoding for the type t you want to serialise. The performance of the serialisation depends heavily on how you use the provided combinators to assemble that encoding. E.g., you can represent all integers in the native machine format (which is probably faster) or, provided your code has the right invariants, you can represent some integers in a fraction of the space (say int8, or int16).

Because the library gives you a lot of control, you can choose where you want to be on the space-vs-time tradeoff. You can include a lot of dynamic check right in the encoding, or you can leave that responsibility to the caller.

All in all, it’s a pretty different library. You could use combinators to define an encoding that’s compatible with a serialising library written in a different language. You could use a standard compression algorithm as part of an encoding. You could manually align the encoded data on an arbitrary boundary so that it is split cleanly by network packets so that partial decoding works better.

In short: you have more control over both the representation of the data and the process of serialisation/deserialisation, so you can decide to go fast (although I’m not sure you can go as fast as Marshal) or go compact or something else.

1 Like

@raphael-proust only talked about performance; I haven’t dug very far into data-encoding's implementation, but I assume that it doesn’t carry the risk of segfaults or similar failures, which can happen when using Marshal to deserialize data with a layout that does not match the desired type.

That’s correct. When trying to deserialise, the failure modes of deserialisation are more informative:

  type read_error =
    | Not_enough_data
    | Extra_bytes
    | No_case_matched
    | Unexpected_tag of int
    | Invalid_size of int
    | Invalid_int of {min : int; v : int; max : int}
    | Invalid_float of {min : float; v : float; max : float}
    | Trailing_zero
    | Size_limit_exceeded
    | List_too_long
    | Array_too_long

  exception Read_error of read_error

  val pp_read_error : Format.formatter -> read_error -> unit

Note also that the library includes some safe versions of deserialisation functions that return option values.

  (** [of_bytes enc buf] is equivalent to [read enc buf 0 (length buf)].
      The function returns [None] if the buffer is not fully read. *)
  val of_bytes : 'a Encoding.t -> Bytes.t -> 'a option

  (** [of_bytes_exn enc buf] is equivalent to [of_bytes], except
      @raise [Read_error] instead of returning [None] in case of error. *)
  val of_bytes_exn : 'a Encoding.t -> Bytes.t -> 'a

Note however that it is possible for users of the library to create encodings that can fail with a different exception. Specifically, because users have control over the deserialisation process, they are free to raise exceptions when appropriate to their own deserialisation process.

I wonder why not have val of_bytes : 'a Encoding.t -> Bytes.t -> ('a, read_error) result since you already spent the time to construct these nice errors!? (and there’s Result.to_opton in the stdlib, but the other way around – once you’ve None you can’t recover the read_error.)

This is due to a series of historical design decisions. Mostly it has to do with the interplay between the error monad and the data-encoding library in the original project:

Error management in the Tezos project uses an error monad (not yet released, but possibly some time in the future). And it turns out that having encodings for errors was more useful than having errors for encodings: so the error monad depends on the data-encoding library but not vice versa. And so the data-encoding used very bare-bones error-management.

I’ll add this feature to the data-encoding library. And I’ll include it in the v0.2 release.

1 Like

The newly released version (0.2) addresses this. All the binary reading/writing primitives use result by default and have _opt and _exn variants.

The JSON primitives are not yet changed because they rely on an external library that has more idiosyncratic error management. (This will eventually be fixed in a future version.)


Is this lib okay to use with js_of_ocaml?

If I remember correctly, there was some issue with the zarith dependency (which is used for arbitrary precision integers), but this may have been fixed now. I’ll try to check and let you know.

It should be possible to get it to work using the zarith_stubs_js package for support. But I don’t know if it would work out-of-the-box. I’ll open an issue on the tracker and I’ll try to get js-support packaged for the next release.

If you try it and manage to get it to work (and even if you don’t), let me know!

While discussing data serialization/deserialization with Marshal in Marshal determinism and stability, I came to know that Marshal does not guarantee that serialized data between different ocaml versions work. In particular I am looking at Binary encoding/decoding of this package. Does data-encoding support serializing/deserializing among diffrent ocaml versions?

The serialisation format of data is meant to depend exclusively on your definition of the encoding (the 'a encoding value that you produce) and not on the version of the language that is used. In other words: yes, you can serialise/deserialise among different ocaml versions.

This is a very important property of data-encoding's use in Tezos because it is vital that newer version of the binaries are able to decode data produced by older versions of the binaries. We have some regression tests specifically designed to catch changes in the binary representation of existing data.

Note however, that the Tezos project only bumps its OCaml dependency from time to time. As a result, the aforementioned regression tests have not been run on the most recent releases of OCaml.

Also note that the data-encoding library is only available for a given range of OCaml versions (currently >= 4.08). It should be possible to add support for some older versions (if it just means adding a stdlib forward compatibility shim) but not all (some dependencies of data-encoding such as zarith might have hard limits).


You might be interested in version 0.3 of the library, and specifically about the performance improvements that shipped in it: [ANN] data-encoding.0.3: performances and streaming - #2 by raphael-proust