Seq vs List, optimization

A bit of a spoiler for an upcoming release of a few of our libraries at Nomadic Labs…

We had a bug report: calls to some RPCs exposed by some of our binaries would occasionally cause some lag. One of the root causes of the issue was JSON serialisation. The original serialisation scheme was intended for a limited range of uses (especially, small sizes) but then it was used outside of this intended range and some relatively big values were serialised and pushed down the RPC stack.

To circumvent this, we are about to release

  • a “json lexeme sequence” backend for our serialiser library: construct_seq : 'a encoding -> 'a -> json_lexeme Seq.t where json_lexeme = Jsonm.lexeme = [ `Null | `Bool of bool | … | `As | `Ae | `Os | `Oe ]
  • a json lexeme sequence to string sequence converter.

For this second part, we actually have three different converters intended for slightly different uses. They have different granularity, they have different allocation profiles, and they make slightly different assumption most notably about concurrency:

  • string_seq_of_json_lexeme_seq : chunk_size_hint:int -> json_lexeme Seq.t -> string Seq.t which uses one (1) internal buffer of size chunk_size_hint. Consuming one element of the resulting sequence causes several json lexemes to be consumed and printed onto the internal buffer until it is full. When this happens, a snapshot (copy) of the buffer is delivered in the Cons cell. So for chunk-size-hint of, say, 1Ko, the sequence translator uses roughly 1Ko of memory and emits 1Ko chunks of memory that the consumer is responsible for.
  • small_string_seq_of_json_lexeme_seq : json_lexeme Seq.t -> string Seq.t which translates each of the lexeme as a single string. It’s a little bit more than a simple Seq.map because it needs to insert separators and escape strings. It mostly returns statically allocated strings so there are no big allocations at all.
  • blit_instructions_seq_of_jsonm_lexeme_seq : buffer: bytes -> json_lexeme Seq.t -> (bytes * int * int) Seq.t which works somewhat similarly to the first one but uses buffer instead of allocating its own. And it returns a seq of (source, offset, length) which are intended to be blitted onto whatever the consumer wants to propagates the data too. This barely allocates at all (it currently does allocate relatively big chunks when escaping strings, but we have planned to improve this in the future. (The sequence returns a source to blit; this source is physically equal to buffer most of the time but not always; specifically, for large strings that are present within the json data, the sequence just points to them as a source.)

Note that the description above is a simplification: there is a bit more to it than that. Also note that all this is still Work In Progress. Check out Construct seq (!5) · Merge requests · Nomadic Labs / json-data-encoding · GitLab (the value to json lexeme sequence code) and json streaming (!19) · Merge requests · Nomadic Labs / data-encoding · GitLab (the json lexeme sequence to string sequence code).

4 Likes