Guess what?! Version 0.3 of data-encoding is now available!
Data-encoding is a library for defining encodings, which can then be used to
serialise and deserialise values to-and-from a binary or JSON representation.
let e : (string * int list) t =
obj2 (req "name" string) (list uint8)
let v : (string * int list) =
("low", [0;1;2;3;4])
let s : string = Binary.to_string_exn e v
let j : Ezjson.t = Json.construct e v
In addition to numerous miscellaneous improvements, this version brings two
major changes.
Support for streamed JSON serialisation.
JSON serialisation for large values can be expensive. And in some context,
prohibitively so. One such context is cooperative concurrency where
serialising very large JSON values can block other on-going tasks.
Data-encoding provides the necessary functions to serialise values as
sequences of strings which can be consumed with appropriate yielding.
let rec write seq = match seq () with
| Seq.Nil ->
Lwt.return_unit
| Seq.Cons (chunk, seq) ->
Lwt_io.write oc chunk >>= fun () ->
Lwt.pause () >>= fun () ->
write seq
in
let j = Json.construct_seq e v in
let s = Json.string_seq_of_jsonm_lexeme_seq ~chunk_size_hint:512 j in
write s
Performance improvements.
The serialisation and deserialisation of some encodings has been optimised a
lot. On some encodings the performances have gotten competitive with Marshal.
According to the micro-benchmark below data-encoding.0.3 gets close to Marshal performances on the serialising and deserialising of Micheline values (S-EXP-like values used to represent smart-contracts on the Tezos blockchain).
The results are printed below. Notice the speed up:
for serialising it progressed from a 13.40Γ slow-down over Marshal to a 1.01Γ slow-down,
for deserialising it progressed from a 18.72Γ slow-down over Marshal to a 1.02Γ slow-down.
Maybe not so related, but I cannot resist mentioning it.
There was a paper about adding type-safety to Marshal.
It looked pretty cool:
βTyping Unmarshalling without Marshalling Typesβ
Henry, G., Mauny, M., Chailloux, E., & Manoury, P. (2012). Typing unmarshalling without marshalling types. ACM SIGPLAN Notices , 47 (9), 287-298. http://michel.mauny.net/data/papers/henry-mauny-chailloux-manoury-2012.pdf
I would be curious to know how you got these performance improvements. Are there some big ideas that delivered large improvements? Do you have references/pointers?
Iβll let @yurug answer more specific follow-up questions because he made that happen, but the basics perf-only changes, transparent to the end-user, are:
Replace linear-access list by random-access arrays where possible (required a bit more than just changing types)
Memoise the application of mu to avoid recomputations of the encoding during construction/desctruction
Use unboxed uint option where none = -1 for some internal counting
And then we also introduced a new combinator for sums (called matching) where you can pass a function that does the matching rather than rely on a list of cases. The previous combinator (union) is still available but the new one should be preferred for big case lists where performance of encoding is a concern.
Note that, as mentioned above, the micro-benchmark above was run specifically on Micheline values. We have not benchmarked other scenarios, we were primarily interested in optimising the construction and destruction of these values. If you have specific examples of encodings that are slow, especially encodings that would be above 10Γ the time of Marshall, weβd be very happy to hear about them!