Combinator library for extracting data for s-exps?

I think we need to add support for YAML and JSON in the stdlib. These are critical wire-formats and datatypes all over the industry, and if OCaml is going to remain relevant, it needs to support them. I hear that TOML is coming on fast, and so we oughta support that too.

This, but entirely unironically. Yaml is more of a config language, but you can’t lift a stone without finding a handful of json files beneath these days. Not having JSON support is like not having utf-8 support (… which we only just added recently?).

When you write compilers like it’s 2000, it’s fine not to have JSON. But as soon as you touch anything networked, http and JSON are everywhere (and also, hex, base64, etc. but I digress). It’s a pity not to have them. Even for compiler-style programs, provers, etc. these days it’s hard not to want to provide JSON diagnostics or metadata — e.g. META files done today would very likely use that format.

2 Likes

Multiple thoughts:

  1. Have you heard of “jsonnet” ? It’s a weird functional language, that computes over JSON. Erm, that is, its origins are a language that computes over protocol buffers, but since protobufs are really JSON … Places like Databricks use it to compute the config-files for cloud deployments and such.

  2. That is to say, computing over JSON is a big, big, big problem, and “well, write a bunch of code in your favorite programming language” isn’t an answer (as you rightly note) b/c boy howdy, that’ll take forever, and be about as maintainable as “write a bunch of code that accesses btree indexes directly” was at the dawn of relational databases.

We need query and computation languages over JSON for the same reason we needed relational query languages.

  1. YAML ought to be[1] just another syntax for JSON. Most people who use YAML, use it as precisely that. And boy howdy, everything you need for JSON, you also need for YAML. B/c the size of these YAML files that are generated by auto-configurators, and that then you have to modify by hand … geeeeeez.

[1] YAML has these weird bits of syntax that most people don’t use, b/c they recognize that different YAML parsers accept different subsets of the language. It’s all a big mess. I came up with my own subset, designed so that one could write a parser in any language that would accept that language on-the-nose, but hey, not like I can convince anybody to use it: GitHub - chetmurthy/yay: YAY Ain't YAML

FYI, for JSON, Tezt has some combinators that I find very nice to use in practice: tezt/lib/JSON.mli · master · Tezos / tezos · GitLab

You write stuff like JSON.(json |-> "people" |=> index |-> "name" |> as_string).

One nice feature is error messages give nice locations. For instance, assuming json has an origin of "file.json" (the origin is set at parse time), you can get errors like:

file.json, at people: not an array
file.json, at people.[42]: not an object
file.json, at people.[42]: missing field: name
file.json, at people.[42].name: not a string

A typical example to decode a record would be:

let decode_vector json = {
  x = JSON.(json |-> "x" |> as_float);
  y = JSON.(json |-> "y" |> as_float);
}

One drawback is this approach is quadratic in the number of record fields though. The alternative is to iterate over each field one by one, setting mutable values along the way, before packing all those mutable values in a record, but that’s quite annoying to write.

I’m sure something like this can be done for s-expressions.

This is one of the reasons you want a query language and not a set of combinators.

I fail to see how a query language would help to convert JSON values to actual OCaml records, do you have an example? Unless the query language comes with a PPX or something like that, in which case the problem would be solved by the PPX itself, not the query language, right?

I was responding to @gasche 's original problem. There, the issue is to construct some cut-down data-structure from the original full sexp/JSON. In an earlier comment in this thread, I mentioned that I’d written an OCaml implementation of @stedolan 's jq; so did someone else: Query-json: Re-implemented jq in Reason Native/OCaml ) ; two thoughts:

  1. this would allow the query-engine to produce an OCaml JSON value
  2. at least when I wrote my interpreter, it was straightforward to imagine how to produce instead a code-generator, which could easily be converted into a PPX.

Now that I think about it, it seems … obvious that we could repurpose jq to solve your problem pretty much on-the-nose. Imagine:

  1. s-expressions of the form ((a b) (c d)...) are treated as JSON dicts.
  2. otherwise, s-expressions of the form (e1 e2 ...) are treated as JSON lists
  3. other cons nodes are errors. Or maybe we invent syntax to do car/cadr
  4. and everything else maps to strings.

This is a simple transformation of JSON to s-expressions, and you could construct the reverse, so that sexp->json->sexp is the identity function (ignoring case #3).

Then, one could just reuse JQ to do the querying.

It seems like that’d solve your problem? And since there are two different JQ implementations in OCaml …

This library might be what I was looking for for parsing the kicad schematics file format (for instance here ).

However, sorry to piggyback on this thread, but I don’t see how to parse a list of a sum type such as the description of a symbol

(symbol "Conn_Coaxial_0_1"
        (arc (start -1.778 0.508) (end 1.778 0) (radius (at -0.0508 0) (length 1.8034) (angles 163.6 0))
          (stroke (width 0.254)) (fill (type none))
        )
        (arc (start 1.778 0) (end -1.778 -0.508) (radius (at -0.0254 0) (length 1.8034) (angles 0 -163.8))
          (stroke (width 0.254)) (fill (type none))
        )
        (circle (center 0 0) (radius 0.508) (stroke (width 0.2032)) (fill (type none)))
        (polyline
          (pts
            (xy -2.54 0)
            (xy -0.508 0)
          )
          (stroke (width 0)) (fill (type none))
        )
        (polyline
          (pts
            (xy 0 -2.54)
            (xy 0 -1.778)
          )
          (stroke (width 0)) (fill (type none))
        )
      )

The gitlab instance is not opened to external users, so its issue system is quite useless.

Thanks.

1 Like

I just wrote a parser for my input data using Decoders_sexplib, and the result works. It’s the only library recommended in the thread that solves my specific problem, so far.

  open Decoders_sexplib.Decode

  let module_deps_decoder =
    let+ for_intf = field "for_intf" (list string)
    and+ for_impl = field "for_impl" (list string)
    in for_intf @ for_impl

  let module_decoder entry_name =
    let+ name = field "name" string
    and+ impl = field "impl" (list string) |> map List.hd
    and+ deps = field "module_deps" module_deps_decoder
    in (entry_name, name, impl, deps)

  let exec_decoder =
    let* entry_name = field "names" (list string) |> map List.hd in
    field "modules" (list (module_decoder entry_name))

  let lib_decoder =
    let* entry_name = field "name" string in
    field "modules" (list (module_decoder entry_name))
    
  let entry_decoder =
    list_filter (
      string |> uncons @@ fun kind ->
      match kind with
      | "executables" -> let+ v = list exec_decoder in Some v
      | "library" -> let+ v = list lib_decoder in Some v
      | _ -> succeed None
    )
    |> map List.flatten
    |> map List.flatten

Note that entry_decoder is an example of a decoder working on a sum type / variant as you mention:

  1. in this example I use list_filter to only handle the variants executable and library and ignore the others
  2. there is an extra level of list wrapping (and List.flatten in the result), due to I think the inner working of the Decoders library which was designed with JSON rather than s-exprs in mind. I’m not sure but I think that it normalizes (polyline foo bar) into something like (polyline (foo bar)).
1 Like

Yes, gitlab.inria.fr is a bad place to host community-oriented free software. Unfortunately the admins are aware of it and they don’t want to change this (and don’t have the workforce resources to change it), and most users don’t think too deeply about the implications of their hosting choice, or are not aware of the problem. I think the best route is to ping the authors kindly (here @esope) to see if they would consider hosting their software on gitlab.com instead – or some other place.

(Hopefully those problems will magically solve themselves once we have proper federation between git forges…)

1 Like

Just curious what didn’t work with Sexpq ? I didn’t follow closely but I don’t see what you couldn’t possibly express.

That’s one of problem with s-expressions. There is no well defined encoding of dictionnaries, or rather people who write them by hand do not want to use the clean ones.

In lisp you would write them as a list of bindings, a binding being (key . <s-exp>).

In config files it seems no one wants to write that .. You could do (key <s-exp>) but it seems again no one wants to write the extra parens when you bind a key to a list.

So we end up with this bastardized notion of binding which is not so great since you can no longer distinguish between a binding to a singleton list and a binding to an atom without external knowledge (it also makes substitution and other operations harder than it could be).

I didn’t try serialk because I understand that it’s not released / available on OPAM yet. The design looks nice, but I’m planning to include my sexp-extraction code in a PR in an upstream project and I want to stay with opam-released dependencies.

Unrelated: one thing I appreciate about Decoders (and I guess Serialk too) is that thought was given to error reporting. It’s not something that my quick&dirty hand-written do, and I think that’s a large part of the value of using a specialized library.

Glad you found decoders useful.

Re the extra level of list wrapping: this is not intrinsic to Decoders itself, but it is intrinsic to the uncons combinator. uncons peels off the head of the list, but the tail is still a list.

To solve this I’d probably use uncons twice - once for the kind, as you already have, and again for the exec_decoder/lib_decoder.

You could define a let operator for uncons to make this look nicer. I’d add this to Decoders but I’m hesitant to add a whole barrage of cryptic operators.

Something like this:

let ( let*:: ) x f = uncons f x

let nil =
  list value >>= function
  | [] -> succeed ()
  | _ -> fail "expected an empty list"

let entry_decoder kind =
  match kind with
  | "executables" ->
      let+ v = exec_decoder in
      Some v
  | "library" ->
      let+ v = lib_decoder in
      Some v
  | _ ->
      succeed None

let entry_decoder =
  (* pop the first element off the list *)
  let*:: kind = string in
  (* now pop the second *)
  let*:: entry = entry_decoder kind in
  (* optional - assert we have nothing left to decode *)
  let+ () = nil in
  entry

let entries_decoder =
  list_filter entry_decoder |> map List.flatten

Note there is still a List.flatten. This is not due to Decoders, but due to the shape of the sexp and the shape of the desired result. In the source sexp the separate executables and library stanzas contain lists that we just want to concat together.

1 Like

Hi @jnavila : I think the combinators you are looking for are variant, field and repeat_full_list. I pushed your example as a test on the repository.

And I also agree that it is a pity that gitlab.inria.fr is not open by default to external contributors. I can ask to open an account for you, if this is what you want.

I see thanks (technically there’s a version available through the b0 package but don’t use it) – given the way you wrote your message I thought you had hit some kind of expressiveness issue.

I agree with these points wholeheartedly.

PPX is great where the producer and consumer are both under your control, in an internal codebase. For everything else, I want the expressivity of writing parsing code.

Decoders tries to make this as easy as possible. Often, composing decoders is mechanical and mirrors the shape of your types. But where it doesn’t, it’s easy to adjust the decoder, and the adjustment is transparent to other people reading your code.

For example, handling versioned data is trivial with decoders. Just use one_of: try the latest version first, and fallback to the old version if it fails. Or, if you’re lucky enough to have a version field in your data, decode it, and switch on it to choose how to decode the rest.

In the current released version (0.7.0), Decoders tries to treat everything as being shaped like JSON. As such, it has to make a decision on how “objects” are represented in S-expressions (it follows Dune - see the note at the top of Decoders_sexplib.Decode · decoders-sexplib 0.7.0 · OCaml Packages).

In the next (unreleased) version, we are exposing a lower-level ('i, 'o) Decoder.t type (see decoder.mli). This is useful wherever you are decoding a type 'i into a type 'o with some possibility of error. I hope this will pave the way for Decoders interfaces to non-JSON-like formats, such as XML (see xml.ml).

1 Like

Another tip: since there is nothing in here specific to S-expressions, you could write it as a functor using the Decoders.Decode.S interface:

module Decode(D : Decoders.Decode.S) = struct
  open D

  ...
end

Then you can instantiate it with module Sexp_decode = Decode(Decoders_sexplib.Decode).

The benefit is you can now decode JSON, CBOR, msgpck, YAML, of the same shape for free.

Might not be all that useful for your use case, but we use this pattern a lot so we can instantiate our backend decoders with Decoders_yojson.Basic.Decode and our frontend decoders with Decoders_bs.Decode (for Bucklescript/Melange).

If you’re writing a library, it also leaves your users free to chose their favorite ocaml JSON library (Yojson, Jsonm, jsonaf, etc).

3 Likes

Yes, this can be used to skip some fields, but the reason I’m not using it is that I find it more useful to explicitly capture and ignore all fields. This acts as a safety net when parsing responses from something that might change its schema (a bit like warning 9), and gives an opportunity to quickly change the type of the field when realizing later you need it.

Waa! Thanks a lot for the head start! Now, I have no excuses to procrastinate :laughing:

2 Likes

Hi, years ago I was not able to use Jane-St’s solution (I didn’t found any example of use, I didn’t found how to use it to read s-expression the same way than XML from the .mli, and no one answered me when I asked on the forum/list about it), so I made my own solution:

It’s very small, and with no deps, you can easily include it into your project.

2 Likes