Combinator library for extracting data for s-exps?

Context

I want to write a simple script to consume some s-expressions output by dune and do something with them.

Input looks like this:

((executables
  ((names (main))
   (requires
    (2e58431e757317d4157ed69cdbce2cb0 c480a7c584d174c22d86dbdb79515d7d))
   (modules
    (((name Main)
      (impl (_build/default/src/bin/main.ml))
      (intf ())
      (cmt (_build/default/src/bin/.main.eobjs/byte/dune__exe__Main.cmt))
      (cmti ())
      (module_deps ((for_intf ()) (for_impl ()))))))
   (include_dirs (_build/default/src/bin/.main.eobjs/byte))))
  ...)

And I’d like to get the “name” and list of modules and their dependencies for each “executables” of the toplevel list.

Question

Are there opam packages that offer small combinator libraries to write such code? I looked at the Sexplib documentation and I’m not under the impression that there is something there. (I think that it’s a rather basic question and it’s weird that the documentation does not answer it, but oh well.)

What I found

Within Sexplib, there is:

  • a Conv module that provides some foo_of_sexp conversions that could conceivably be used for this, but are more meant for one-to-one mapping to OCaml types. (Note: here I consider that the s-expression I get is random, and may not fit an OCaml type.)
  • a Grammar module that looks like a much-more-engineered version of what I would naively look for, and probably overkill for what I’m looking for. (Also, an example or two in the documentation would help, but oh well.)
  • a Path module that looks like it may contain nice ways to do selector-based data lookup in s-expressions but is in fact aimed at something else (substitutions)

Home-grown example

I spent an hour writing something specific yesterday, writing combinators at the same time, and it looks like this. I’m not saying this is good (I’m sure it will look different next time I write it again), but it gives a rough idea of the sort of combinators I would have expected to find a library for.

let read_entries entries =
  let entries = list entries in
  let* (name, entry) =
    begin
      let+ exec = select_all ["executables"] entries in
      (select_single ["names"] exec |> list_of atom |> single,
       exec)
    end @ begin
      let+ lib = select_all ["library"] entries in
      (select_single ["name"] item |> atom,
       lib)
    end
  in
  let+ module_ = select_single ["modules"] item |> list_of list in
  let name = select_single ["name"] module_ |> atom in
  let impl = select_single ["impl"] module_ |> list |> single |> atom in
  let deps =
    let deps = select_single ["module_deps"] module_ |> list in
    (select_single ["for_intf"] deps |> list_of atom)
    @
    (select_single ["for_impl"] deps |> list_of atom)
  in
  (item_name, name, impl, deps)

JSON

This looks like a problem common to other data forms, for example Json data. Curiously enough, I don’t remember looking for such a library to write with Json data in the past: I guess the fact that it is just slightly more structured (with explicit representation of lists and dictionaries) makes it easy to just use List.assoc directly.

2 Likes

Two things come to mind (haven’t used any myself):

Cheers,
Nicolas

2 Likes

This is somewhat orthogonal to your question, but note that the s-exps that Dune produces typically do fit OCaml types. For example, dune/bin/describe.ml at bc521522dcab27b2963a6445454d386fbe81fbef · ocaml/dune · GitHub

Cheers,
Nicolas

Thanks for the pointers! Sexpq from Sexp_serialk looks like what I was looking for.

Since you linked to the version that is vendored in b0 I’ll just provide the link to the library where these combinators are eventually to be published. The docs may be more up-to-date.

This library has also support for JSON and for other formats planned or in progress with a similar API. The JSON API is not yet on part with the sexp one (no layout preserving updates) but I do however use it fairly often to interact with web services in my projects and I find it quite convenient.

The only thing I’m a bit unhappy is that queries can’t be used for generation. I’d like to explore a bit if it’s not possible to fix that but maybe shoving that in the same API will thwart the current API ergonomics which is quite good in my opinion. Also I was surprised to find out I could use the query API for updates as well but @let-def complained that you could only perform a single update with it, a comment that now haunts me :ghost:.

The whole thing needs a new design round and to be adapted to the stdlib UTF codecs but serialk is one of the things I hope to be able to bring to a first release this year (famous last words).

4 Likes

@gasche : you might be interested in the sexp_decode library. It is available on OPAM.

3 Likes

@esope I gave Sexp_decode a try for my use-case (it has the advantage of being released), and I hit two issues.

Edit: I filed the two comments in the issue traciket (#1, #2) in case it might be more pleasant to keep track of.

Sexp type (minor)

I am using the Sexplib parsers (Sexp.load_sexp), which returns a value of type Sexplib.Type.t which is not the same as Csexp.t – they are two identical but incompatible type declarations. I tried to find a parsing function in Csexp, but my understanding is that the library only deal with the binary encoding of s-expressions, not their common textual format, so I ended up writing a conversion function by hand.

Note that Csexp itself does not suffer from this issue as it offers a functor parametrized over the sexplib type, so I could call Csexp.Make(Sexplib.Type) and get compatible types. Maybe Sexp_decode could offer a functorized interface in the same way? (Or maybe Csexp and Sexplib could agree to share their type definition with a common dependency.)

Partial decoding

I’m trying to parse only some parts of the data, not decode all of it into an isomorphic representation. In particular, I have a key-value s-expression and I want to get the fields foo and bar, in any order, ignoring the other fields. I did not find a way to do this using the current Sexp_decode API, which seeems designed to fetch all fields at one. I wrote the following

  let module_decoder entry_name =
    let+ name = field "name" atom
    and+ impl = field "impl" (list1 atom) |> map List.hd
    and+ deps = field "module_deps" deps_decoder
    in (name, impl, deps)

and I was expecting to run against data as in my example:

     ((name Main)
      (impl (_build/default/src/bin/main.ml))
      (intf ())
      (cmt (_build/default/src/bin/.main.eobjs/byte/dune__exe__Main.cmt))
      (cmti ())
      (module_deps ((for_intf ()) (for_impl ()))))))

but, re-reading the documentation, my usage of field can only works if there is no data to ignore in-between, and I list the fields exactly in order. I don’t want to do that. I could use the record combinator but then I would have to list all possible fields I think, and I don’t want to / cannot do that. (I’m only observing one run of the data producer, there may be other fields that will show up with other options or as the producer changes.)

If I understand correctly, the underlying mental model of Sexp_decode is to consume input in the order of the decoders in the code. For partial decoding, I’m rather looking for the ability to call several partial decoders on a given input, without consuming input. The only combinators that allow this in the current API are meant for backtracking: you can call several decoders on the same input (first, or_else), but only in the case where one decoder fails.

I’ve done that in the past with ppx_sexp_conv. You describe the shape for the data and directly use the generated functions. The main trick is that you don’t have to map precisely everything: if you declare a field as sexp, it will parse as the sexp itself. And you can use a any type to put a unit hole instead of the sexp if you don’t need it.

type any = unit [@@deriving sexp_of]

let any_of_sexp _ = ()

type executables_stanza = {
  names : string list;
  requires : string list;
  modules : any;
  include_dirs : any;
}
[@@deriving sexp]

type library_stanza = {
  name : string;
  uid : string;
  local : bool;
  requires : string list;
  source_dir : any;
  modules : any;
  include_dirs : any;
}
[@@deriving sexp]

type stanza = Executables of executables_stanza | Library of library_stanza
[@@deriving sexp_of]

let stanza_of_sexp = function
  | Sexp.List [ Atom "executables"; s ] ->
      Executables (executables_stanza_of_sexp s)
  | Sexp.List [ Atom "library"; s ] -> Library (library_stanza_of_sexp s)
  | Sexp.List [ Atom atom; _ ] -> raise_s [%message "stanza_of_sexp" atom]
  | sexp -> raise_s [%message "stanza_of_sexp" (sexp : Sexp.t)]

type t = stanza list [@@deriving sexp]

decoders has a sexplib adapter. I don’t think it mandates using all the fields.

side note: I’m really vexed that, years later, we still don’t have a standard sexp type in the stdlib. I’m pretty sure I suggested that years ago, got shut down (can’t find the discussion) because “sexplib0 is there”; and now of course csexp doesn’t even use sexplib0. Grr.

2 Likes

This is a bit OT but I’m pretty sure it was shot down for other reasons. People always want something, either sexp or json or their favourite thing du jour (15 years ago it would have been XML). It’s a good idea not to have this in the stdlib.

All of these representations are largely broken anyways for a language which rich types like OCaml. s-expressions are not better either, there no established standard definition and there’s a least three different ways of encoding dictionaries in them.

I feel like a broken record but as far as the stdlib goes what you are looking for a good runtime type representation.

4 Likes

There is absolutely one representation for sexprs that several libraries use:

type t = Atom of string | List of t list

and I’m not saying the stdlib should pack a parser. Only the type (just like Uchar for a while). Your argument of “thing du jour” holds for everything, we also shouldn’t have Map because it’s AVL based, so, passé, when these days we should use HAMTs or RRB trees or whatnot. The stdlib can contain a few compatibility types and not implode.

2 Likes

And that representation is bad if you want to be able to report good error locations…

If you really want your simplistic compatibility type why don’t you simply define it as:

type t = [`Atom of string | `List of t list]

and go convince other libraries to use that ? No need to add anything to the stdilb and you can do it now.

I’m afraid I don’t see the connection. The datatype is abstract.

But why should it contain useless cruft ? Your representation is not a very good one and s-expressions are not that great in practice.

2 Likes

Thank you for the feedback. It is true that sexp_decode was designed for total decoding, as opposed to partial decoding, and thus might not fit your needs.
I will have a look at the tickets you created as soon as I find time.

For the record, I used the csexp type, since this is the recommended way to analyze the output of dune describe (rather than analyzing an sexp).

I’m wondering whether you can achieve the partial decoding of your example using a combination of first, drop and list…

the Map.S.t type is abstract but it’s condemned to be a balanced tree since it takes compare, not equal + hash. But yes, this is a digression.

I personally do define S-exprs with the poly variant you listed; but sexplib has a lot of mindshare and it does not. We’re stuck with fragmentation there as in many other places.

I know this is not responsive to your original question, but still: is it possible for dune to export that data as JSON? If so, then … the jq tool (from @stedolan ) is … yummy. I mean Red Velvet Cake yummy. It’d allow you to do the partial-data-extraction you desire.

My interest in pulling this thread is that I think this is a fairly basic need, which should have a simple answer within the OCaml ecosystem. I could certainly use an external tool, or another language, write my own throwaway code, or use Dune’s internal types, but none of those are general solutions that benefit the ecosystem.

1 Like

If this were JSON, I could respond with “I wrote an OCaml clone of stedolan’s JQ that has most of the functionality, mostly for the fun of it”. I suspect that with only a little work, one could define an “sexp-query” language like “jq”'s query language (it’s been a while, so I forget whether he got it from someplace else, or invented it himself) and then write a query engine for that.

Would that help? I mean, yes, I agree that having query-languages for significant config-file/data formats is valuable. I’d argue that that format should be JSON, but only b/c of its massive prevalence. A priori, I have no objection to sexp.

+1 for decoders, I haven’t used it for sexps, but had great success in using it to partially decode JSON objects that loosely follow a fairly open spec:

let announce obj =
  let open D in
  let* () = field "type" @@ constant ~msg:"expected Announce object (received %s)" "Announce"
  and* actor = field "actor" id
  and* id = field "id" string
  and* published = field_opt "published" timestamp
  and* to_ = field "to" (singleton_or_list string)
  and* cc = field_or_default "cc" (singleton_or_list string) []
  and* obj = field "object" obj
  and* raw = value in
  succeed ({id; published; actor; to_; cc; obj; raw}: _ Types.announce)

I generally prefer using explicit encoders/decoders to using ppxs - especially when you need to do partial decoding, because trying to encode the exact semantics that I want via ppx often leads to quite convoluted/unnatural type definitions.

edit: ppx based approaches are also less than ideal when you want slightly different semantics for encoding and decoding.

1 Like

@gasche : though I agree that a combinator for your need should be made directly available in sexp_decode, it seems to me that the following decoder solves your needs. Doesn’t it?

let some d = d >>| fun x -> Some x
let none d = d >>> return None
let d = list (first [some @@ field "name" atom;
                     some @@ field "impl" (list1 atom) >>| Option.map List.hd;
                     none skip])
        >>| List.filter_map Fun.id

For your second request about a decoder that does not consume its input, have you looked at the peek combinator?

I created upstream issues about this:

2 Likes