[ANN] Jsont 0.1.0 – Declarative JSON data manipulation for OCaml

Hello,

It’s my pleasure to announce the first release of the jsont libary:

Jsont is an OCaml library for declarative JSON data manipulation. It provides:

  • Combinators for describing JSON data using the OCaml values of your choice. The descriptions can be used by generic functions to decode, encode, query and update JSON data without having to construct a generic JSON representation.
  • A JSON codec with optional text location tracking and layout preservation. The codec is compatible with effect-based concurrency.

The descriptions are independent from the codec and can be used by third-party processors or codecs.

Jsont is distributed under the ISC license. It has no dependencies. The codec is optional and depends on the bytesrw library. The JavaScript support is optional and depends on the brr library.

The library has been used in practice but it’s new so a few adjustments may be needed and more convenience combinators added.

The library also enables quite a few things that I did not have the time to explore like schema description generation from descriptions, quasi-streaming JSON transformations, description generation from dynamic type representations, etc. Lots of this can be done outside the core library, do not hesitate to get in touch if you use the library and find interesting applications or pesking limitations.

Homepage: https://erratique.ch/software/jsont
Docs: https://erratique.ch/software/jsont/doc (or odig doc jsont)
Install: opam install jsont bytesrw

This first release was made possible thanks to a grant from the OCaml Software Foundation. I also thank my donors for their support.

Best,

Daniel

P.S. I think that the technique used by the library, which I dubbed finally tagged is interesting in itself. You can read a paper about it here along with a smaller, self-contained, implementation of what the library does.

17 Likes

Since programmers are always curious about performance, and also a bit irrational about it :–) I just want to throw in a few numbers to nourrish these irrationalities and convice you that despite invoking bazillions of functions jsont remains competitive. Don’t take the numbers below too seriously, except for ruling out that jsont is not incredibly slow. In practice performance profiles are bound to be quite data dependent (floating point parsing, character data beyond ASCII, out-of-order case members, etc.)

I benchmarked a few tools that decode and minify JSON on a particular 78MB file of GeoJSON data that I found online. We try to keep things comparable (e.g. w.r.t. to source layout and location tracking) among the tools but it’s a bit difficult, for example Yojson is notorious for not checking UTF-8 validity on input (an unforgivable sin). This is measured on an ARM64 M2 with 16GB of memory and OCaml 5.2.0.

We compare json_xs (perl with C bindings), jq, ydump (distributed with yojson), jsontrip (distributed with jsonm), jsont (distributed with jsont) and geojson which is a direct modelling of GeoJSON with the jsont library (i.e. it codecs GeoJSON without going though a generic JSON representation):

Benchmark 1: json_xs -t json < tmp/parcels.json
  Time (mean ± σ):      1.344 s ±  0.008 s    [User: 1.244 s, System: 0.093 s]
  Range (min … max):    1.334 s …  1.359 s    10 runs
 
Benchmark 1: jq -c . tmp/parcels.json
  Time (mean ± σ):      1.930 s ±  0.015 s    [User: 1.780 s, System: 0.145 s]
  Range (min … max):    1.918 s …  1.965 s    10 runs
  
Benchmark 1: ydump -std -c tmp/parcels.json
  Time (mean ± σ):      3.647 s ±  0.013 s    [User: 3.529 s, System: 0.112 s]
  Range (min … max):    3.630 s …  3.677 s    10 runs
 
Benchmark 1: jsontrip tmp/parcels.json
  Time (mean ± σ):      3.059 s ±  0.009 s    [User: 3.013 s, System: 0.045 s]
  Range (min … max):    3.041 s …  3.075 s    10 runs

Benchmark 1: jsont fmt -fminify tmp/parcels.json
  Time (mean ± σ):      2.175 s ±  0.006 s    [User: 2.097 s, System: 0.073 s]
  Range (min … max):    2.168 s …  2.189 s    10 runs

Benchmark 1: geojson tmp/parcels.json
  Time (mean ± σ):      1.846 s ±  0.003 s    [User: 1.798 s, System: 0.044 s]
  Range (min … max):    1.843 s …  1.851 s    10 runs

Note that on encoding the bottleneck is on formatting floating point numbers of which that data file is littered with. So the difference between using jsont fmt to decode to a generic representation or using geojson, which directly models GeoJSON, is a bit lost. This shows decoding only on the tools that support it:

Benchmark 1: json_xs -t none < tmp/parcels.json
  Time (mean ± σ):     440.7 ms ±   2.4 ms    [User: 379.7 ms, System: 54.9 ms]
  Range (min … max):   437.8 ms … 445.7 ms    10 runs
 
Benchmark 1: jsontrip -dec tmp/parcels.json
  Time (mean ± σ):      1.557 s ±  0.003 s    [User: 1.529 s, System: 0.027 s]
  Range (min … max):    1.553 s …  1.561 s    10 runs

Benchmark 1: jsont fmt -d tmp/parcels.json
  Time (mean ± σ):      1.100 s ±  0.004 s    [User: 1.039 s, System: 0.056 s]
  Range (min … max):    1.095 s …  1.107 s    10 runs
  
Benchmark 1: geojson -d tmp/parcels.json
  Time (mean ± σ):     798.0 ms ±   1.5 ms    [User: 766.8 ms, System: 28.6 ms]
  Range (min … max):   796.1 ms … 800.3 ms    10 runs
3 Likes

Finally I’d like to say something about usability and then I will shut up :–)

If you program in a language like JavaScript knowing that you will get data as JSON is always a relief: it means no work for getting the data in and out. At least so you think, until you dynamically realize that the data producer is not, or no longer, exactly producing what it told you it would.

So far I would not enjoy the same relief when I knew I’d have to deal with JSON in my OCaml programs. It is one of the goals of jsont to bring that.

Using jsont will still entail more work than in JavaScript, the descriptions (or queries) have to be written. But that extra work allows you to work with natural OCaml datatype definitions, and, when producers start lying, you will get nice error messages with locations like (here for the GeoJSON modelling mentioned before):

Error: Unexpected enum string value: Tapology. Should it be Topology ?
       File "tmp/topology.json", line 2, characters 10-20:
       File "tmp/topology.json", line 2, characters 2-8: in member type of
       File "tmp/topology.json", lines 1-2, characters 0-20: Topology object
       
Error: Unexpected member type value in Geometry object: Curve. Must be Point,
       MultiPoint, LineString, MultiLineString, Polygon, MultiPolygon or
       GeometryCollection.
       File "tmp/topology.json", line 7, characters 21-22:
       File "tmp/topology.json", line 7, characters 6-12: in member type of
       File "tmp/topology.json", lines 6-7, characters 15-22: Geometry object
       File "tmp/topology.json", line 6, characters 4-13: in member example of
       File "tmp/topology.json", lines 5-7, characters 13-22: objects map object
       File "tmp/topology.json", line 5, characters 2-11: in member objects of
       File "tmp/topology.json", lines 1-7, characters 0-22: Topology object

Error: Missing member coordinates in Point object
       File "tmp/topology.json", lines 9-16, characters 8-9:
       File "tmp/topology.json", lines 9-16, characters 8-9: at index 0 of
       File "tmp/topology.json", lines 8-16, characters 20-9: array<Geometry object>
       File "tmp/topology.json", line 8, characters 6-18: in member geometries of
       File "tmp/topology.json", lines 6-16, characters 15-9: GeometryCollection object
       File "tmp/topology.json", line 6, characters 4-13: in member example of
       File "tmp/topology.json", lines 5-16, characters 13-9: objects map object
       File "tmp/topology.json", line 5, characters 2-11: in member objects of
       File "tmp/topology.json", lines 1-16, characters 0-9: Topology object

And of course all this happens in OCaml, free of any kind of ppx nonsense. The result is flexible and lightweight to use and works wonders against bit rot.

Do not fear the extra modelling and boilerplateish step!

2 Likes

Nice, I hadn’t heard of jsont before. While I understand not being a fan of the ppx stuff, I do have to admit that doing [@@deriving yojson] is so simple I can’t help but use it, but the error messages are atrocious. I’m hoping a ppx_deriving for jsont happens.

Well it’s the first release :–)

Perhaps but that kind of thing doesn’t really help with dealing with JSON that you don’t control. I always feel it’s not worth the trouble.

That being said the amazing @art-w came up with something that I wanted to have in the library before prioritizing to solve other problems.

His let operator proposal allows to deal with labelled object constructors (which is less footgunish once you start dealing with a lot of fields with the same types, e.g. that’s how @smondet convinced me to add let operators to cmdliner). I will have a closer look and may standardize the object construction on his proposal so if you plan to use the library maybe stay tuned for 0.2.0 as it may entail a few breaking changes.

10 Likes

Nice. I think let-syntax is great for this too eg [ANN] dream-html & pure-html 3.5.2 - #5 by yawaramin

Isn’t the object creation essentially an applicative?

Yes, and you can also use a monadic bind to add further rules. Eg (from my library):

let* start_date = required unix_tm "start-date" in
let+ end_date = required (unix_tm ~min:start_date) "end-date" in
...

No. It’s a more complicated structure because you need the return type of the application when you apply members. The return type is used by member specifications to specify the projection function used on the result of the application when it’s time to encode back to JSON.

It all started with a simple applicative for decoding generic JSON in memory, but as I wrote here I was frustrated that these applicative decodes specifications would not allow me to encode. Having solved the encode (an the ability to support a few other JSON object codec patterns) I lost the obvious applicative on the way – and my attempts at reframing it as an applicative were not successful.

Now @art-w with some contorsions managed to reframe it as an applicative. But in my enthusiasm for his proposal I failed to see that it seems that he his repeatedly constructing pairs for applying the object constructor which I’m not really happy with as it brings quite a few more allocations for object construction that are not there with the API I settled on. I will have a look in the upcoming days if we keep the current way or switch to @art-w’s scheme.

2 Likes

I don’t claim domain expertise - I just wanted to share the following:

A parametrized abstraction for a codec typically consists of two parts: a reader and a writer.

Separating these two parts can be beneficial, at least in the private implementation (and perhaps even in the API, I’m not certain).

The reader part, being a producer of 'a is covariant in its parameter. It can often be a great candidate for being an applicative.

The writer part, being a consumer of 'a is contravariant in its parameter. It cannot provide a map function; instead, it would be a contra_map.

module Writer : sig
  type 'a t

  val contra_map : 'a t -> f:('b -> 'a) -> 'b t
end

As such, the combined codec cannot be an applicative. I don’t remember much about this, but I vaguely recall using a profunctor library to help with these kind of things in the past.

3 Likes

Oh yes, obviously. I was so focused on decoding I totally forgot about encoding.

I believe the problem of embedding/projecting between a typed and an untyped language is also explored in this paper, which uses OCaml, too.

I can confirm. I’ve built a bidirectional encoding and decoding library in Haskell to deal with TOML and I used exactly Applicatives and Profunctors, so this approach works, and it’s quite nice!

1 Like

(Since it is Applicative and Profunctor, I guess that using a profunctor with a strong tensor is good enough…)

I’m quite happy that we have all these terms that we can throw at each other’s head but somehow I just prefer to say that in order to decode an encode an arbitrary OCaml value you need a typed constructor function (an injection) and typed accessors functions (a list of projections) and somehow your codec structure should make sure they can be made available when you need them.

Personally I always design from the gut and rediscover the structures rather than try to think in terms of them. If you like thinking with these things it’s fine, I’m just saying this for people unfamiliar with category theory[1] and fear of missing something :–)

Also all these high-level structures tell you nothing about how to handle unordered member decoding and data dependent decoding (case objects).


  1. Personally tried a few time but always quickly drown into the sea of arrows and give up. ↩︎

4 Likes

So far I decided not do so. The slow down due the tuple construction does register but at least on a simple example it’s not dramatic. I wonder however how much gc pressure that can put in practice (e.g. on a program that spends its time JSONing over the network).

Also the current proposal doesn’t fully integrate case objects and unknown member handling so it feels a tad clunky. I do not exclude to introduce a syntax in the future but my impression is that it may need a bit more work and should not necessarily entail changes to the current one.

In any case v0.1.1 has been released which fixes a few maps where both decoding and encoding had to be specified (you are generally allowed to specify only one direction) and fixes a build issue.

1 Like

I absolutely agree with you! In fact, in my own discoveries, I came up with something, then learned about underlying theory, and then had to think hard about supporting things like unordered decoding and case objects.

I made a talk on the subject (it’s in Haskell, but ideas are translatable):

And I also wrote a blog post:

1 Like

Yesterday I wrote my first jsont expression to parse a variant type.

type t =
  | Unix of { path : Fpath.t }
  | Tcp of
      { ipaddr : Eio.Net.Ipaddr.v4v6
      ; port : int
      }

let jsont =
  (* This format is used to serialize the sockaddr into the line of a file that
     is read during the service-discovery via file strategy. It is stable and
     should offer good backward compatibility. *)
  let unix =
    Jsont.Object.map ~kind:"Unix" Fpath.v
    |> Jsont.Object.mem "path" Jsont.string ~enc:Fpath.to_string
    |> Jsont.Object.finish
  in
  let tcp =
    Jsont.Object.map ~kind:"Tcp" (fun a b -> a, b)
    |> Jsont.Object.mem "ipaddr" Jsont.string ~enc:fst
    |> Jsont.Object.mem "port" Jsont.int ~enc:snd
    |> Jsont.Object.finish
  in
  let unix_case = Jsont.Object.Case.map "Unix" unix ~dec:(fun path -> Unix { path }) in
  let tcp_case =
    Jsont.Object.Case.map "Tcp" tcp ~dec:(fun (ipaddr, port) ->
      Tcp { ipaddr = Eio.Net.Ipaddr.of_raw ipaddr; port })
  in
  let enc_case = function
    | Unix { path } -> Jsont.Object.Case.value unix_case path
    | Tcp { ipaddr; port } -> Jsont.Object.Case.value tcp_case ((ipaddr :> string), port)
  in
  let cases = Jsont.Object.Case.[ make unix_case; make tcp_case ] in
  Jsont.Object.map ~kind:"Discovery" Fn.id
  |> Jsont.Object.case_mem "type" Jsont.string ~enc:Fn.id ~enc_case cases
  |> Jsont.Object.finish
;;

I’m just dabbling with jsont at this point and haven’t tried 0.1.1 yet.

I feel the same. I first heard about these terms when a colleague pointed me to that library I linked, and remember it being helpful with that particular situation. I am sorry I cannot be more specifically helpful, it’s just been too many years at this point, and don’t have the code.

The memory I have with the lib I linked is its ability to interop with the ppx_fields and ppx_variant preprocessors, which resulted in a potentially desirable sweet spot between with a full-ppx solution with little to no ability to customize on one end and a not-so-ergonomic full-manual solution on the other.

Something like this example of the lib.

I wonder if the code generated by ppx_fields and ppx_variant being specialized to the types in question allows you to reduce the intermediate tuple allocation happening with the let+ syntax (I genuinely don’t know the answer to that).

Ah interesting. Inline records. I never tried with that. For reference if you name your records types it would rather look the first example here.

That being said I’m not very happy with your code: it builds intermediate data structures, it uses Fpath.v which is a no go on untrusted input. I think it’s better to build stuff from small, named, parts.

That’s the way I would have written it (these assert false on accessors would go if you’d use named records):

let fpath_jsont =
  let of_string s = Result.map_error (fun (`Msg e) -> e) (Fpath.of_string s) in
  Jsont.of_of_string ~kind:"fpath" of_string ~enc:Fpath.to_string

let ipaddr_jsont =
  let dec _meta s = Eio.Net.Ipaddr.of_raw s in
  let enc (ipaddr : Eio.Net.Ipaddr.v4v6) = (ipaddr :> string) in
  Jsont.Base.string (Jsont.Base.map ~kind:"ipaddr" ~dec ~enc ())

type t =
| Unix of { path : Fpath.t }
| Tcp of { ipaddr : Eio.Net.Ipaddr.v4v6; port : int }

let unix path = Unix { path }
let unix_path = function Unix { path } -> path | _ -> assert false
let unix_jsont =
  Jsont.Object.map ~kind:"Unix" unix
  |> Jsont.Object.mem "path" fpath_jsont ~enc:unix_path
  |> Jsont.Object.finish
  
let tcp ipaddr port = Tcp { ipaddr; port }
let tcp_ipaddr = function Tcp { ipaddr } -> ipaddr | _ -> assert false
let tcp_port = function Tcp { port } -> port | _ -> assert false
let tcp_jsont = 
  Jsont.Object.map ~kind:"Tcp" tcp
  |> Jsont.Object.mem "ipaddr" ipaddr_jsont ~enc:tcp_ipaddr
  |> Jsont.Object.mem "port" Jsont.int ~enc:tcp_port
  |> Jsont.Object.finish

let jsont =
  let unix_case = Jsont.Object.Case.map "Unix" unix_jsont ~dec:Fun.id in
  let tcp_case = Jsont.Object.Case.map "Tcp" tcp_jsont ~dec:Fun.id in 
  let enc_case = function
  | Unix _ as v -> Jsont.Object.Case.value unix_case v
  | Tcp _ as v -> Jsont.Object.Case.value tcp_case v
  in
  let cases = Jsont.Object.Case.[ make unix_case; make tcp_case ] in
  Jsont.Object.map ~kind:"Discovery" Fun.id
  |> Jsont.Object.case_mem "type" Jsont.string ~enc:Fun.id ~enc_case cases
  |> Jsont.Object.finish

Also you may want to use the following instead of Jsont.int, it may avoid you a few EINVALs:

let tcp_port_jsont =
  let dec meta n =
    let port = int_of_float n in
    if Float.is_integer n && 0 <= port && port <= 65535 then port else
    Jsont.Error.msgf meta "%g: not a TCP port (integer in range [0;65535])" n
  in
  let enc = float_of_int in
  Jsont.Base.number (Jsont.Base.map ~kind:"TCP port" ~dec ~enc ()) 

EDIT: I forgot that there is also Jsont.uint16 but the error message will be less nice.