Combinator library for extracting data for s-exps?

@esope I think that your suggestion works, but (besides being convoluted) it forces all field decoders to have the same result type, which is a severe limitation in expressivity. I could inject them into a large sum, and then “gather” this into a product (your record combinator does this internally).

I hadn’t looked at peek closely, and indeed it looks like it could be a better solution – if you are willing to traverse the input several time, which is fine by me. Something like (not tested):

let unordered_field name decoder =
  list (first [some @@ field name decoder; none skip])
  |> map (List.filter_map Fun.id)
  |> peek
  |> map Option.join

Note: at this point I’ve got a 'a option decoder I think, while I would prefer the option to be internal (in the failure monad). Maybe I would need to (implement this internally or) a combinator 'a option decoder -> 'a decoder?

Note: in the short term I’m planning to give Decoders_sexp a try as suggested here, so I’m just passing by, please don’t spend too much time on this if I am the only prospective user.

Maybe one way to get a common data-type into the stdlib without falling into the fears of just adding the “thing du jour” could be to add a generic unbalanced tree type into the stdlib:

type 'a tree = Leaf of 'a | Node of 'a tree list

and then just have csexp and sexplib instantiate it with strings for their particular definitions of sexps.

2 Likes

[putting on my crusty curmudgeon contrarian hat]

I think we need to add support for YAML and JSON in the stdlib. These are critical wire-formats and datatypes all over the industry, and if OCaml is going to remain relevant, it needs to support them. I hear that TOML is coming on fast, and so we oughta support that too.

[insert pitch for N favorite different data-types and wire-formats used all over the industry]

Indeed, I think that s-expressions are the least important and urgent to add: only theoreticians and guys like me who grew up coding Scheme, care about 'em. Everybody else has moved on … N times and is planning for the N+1-st move to some new format.

If s-expressions are so widely-used, I don’t see the problem with putting them into some nearly-stdlib library and being done with it – everybody who depends on them can just depend on that library.

And I’m (almost, but not quite) serious about YAML and JSON, actually. They’re everywhere in the business. everywhere.

3 Likes

Alright, I don’t want to pretend like this is a “solution” or something. Just that, maybe this is an indication of a way to go forward. And of course, what I’m really arguing is “@stedolan’s onto something with jq, my friends”.

So, here’s what I did:

  1. modified Dune so that dune describe --format=json would put out JSON instead of SEXP. It’s ugly and less-than-ideal, but hey, it’s also 30min work. I’m sure someone who actually understands how Dune builds itself, could do a nicer job.
  1. then used this new describe-output on dune itself, and also on yojson. I did a bit of grepping into the OCaml repos I have lying around in my github cache, and couldn’t find any with “module_deps”, but if someone can point me at one, I’d be happy to do … five minutes’ more hacking. Here’s the files:
  1. And then I started trying to remember how to use jq. Because it’d been a long while. But it was short work (b/c @stedolan 's a wizard). I’m sure this can be done better, but this is what I have (in the file doit.jq):
def procmod: [.[] | .["name"]] ;
.[] | if .[0] == "executables" then . else empty end | {"names" : .[1]["names"], "modules": (.[1]["modules"] | procmod) }

What does it do?

  1. the output of dune describe --format=json is a JSON list of objects, each of which is a list, and the zeroth entry is either “executables” or “library”.

  2. First. the input is a list. So pull each element out of the list, and stream them to the next step.

  3. filter out only values whose zero-th element is “executables”.

  4. then the first element is going to be an assoc-list (dictionary)

  5. pull out the “name” field, and the “modules” field

  6. pass the value of the “modules” field to “procmod”, which right now just extracts all the names of modules.

  7. rebuild a little dictionary to hold the “name” and “modules” we pulled-out above, for each executable.

If somebody can point me at a dune project that actually has module_deps fields in it, I can make it do the rest that @gasche asked for.

So the final result (for yojson – for dune it’s much bigger, b/c so many executables):

{
  "names": [
    "ydump"
  ],
  "modules": [
    "Ydump"
  ]
}
{
  "names": [
    "filtering"
  ],
  "modules": [
    "Filtering"
  ]
}
{
  "names": [
    "constructing"
  ],
  "modules": [
    "Constructing"
  ]
}
{
  "names": [
    "test",
    "atd"
  ],
  "modules": [
    "Test",
    "Atd"
  ]
}

OK. So that was kind of complicated to explain, b/c I didn’t explain the JQ streaming model, or filters, or anything else. But … just look at how compact the query is, and … trust me that it’s actually very readable and … has the nice properties we’d want of a query-language.

1 Like

Slightly OT, but if we’re looking at the original problem, then

For this problem, the following assumption seems to be needlessly strong:

Because dune uses sexps for serialisation, and dune itself is written in OCaml (and thus is serialising OCaml values), there’s a good chance that the s-expression should fit into an OCaml type no?

Something like the following should work no?:

type module_spec = {
  name: string;
  impl: string option;
  intf: string option;
  cmt: string option;
  cmti: string option;
} [@@deriving sexp, show]

type executable = {
  names: string list;
  requires: string list;
  modules: module_spec list;
}  [@@sexp.allow_extra_fields] [@@deriving sexp, show]

type library = {
  name: string;
  uid: string;
  local: bool;
  requires: string list;
  source_dir: string;
  include_dirs: string list;
  modules: module_spec list;
}  [@@sexp.allow_extra_fields] [@@deriving sexp, show]

1 Like

Let me a bit more explicit about why I don’t want to write types, I want to write parsing code directly.

  1. I’m looking for a general solution that works for any s-expression used for interoperability by some tool, not just an OCaml-implemented tool. The reason for my question in the first place is not to help me solve one specific problem involving Dune, but rather to find out if the existing ecosystem solves this need that I think is important. (It’s important for any data format used for interoperability in practice, including JSON and YAML and whatever as well indeed. )

  2. Using meta-programming in this way is inherently less expressive than writing parsing code: it works well as long as you stay within what the metaprogramming tool knows how to express, and you fall off a cliff very quickly when you step outside this fragment. Granted, ppx_sexp_conv is surprisingly expressive (I didn’t know about allow_extra_fields, and apparently @emillon didn’t either), but I’m sure that there is a cliff waiting somewhere and I don’t want an approach that leaves you with “open an issue against the metaprogramming library and wait for it to be solved” once your needs evolve.

  3. Using meta-programming in this style is also non-obvious to beginners, less discoverable. I hope that the ecosystem has a solution for a standard need that people know to look for.

  4. Specifically on @nojb’s suggestion of reusing Dune’s type definitions directly, again it is not a general solution to the problem of interoperability, possibly with non-OCaml code. They also introduce tight coupling with Dune. If I wanted to do this, why not directly call Dune as a library and work with OCaml values all the way?

I would like to convince you to look at this differently and far more aggressively. I think what you should be looking for, is a way to treat sexp/JSON/YAML/whatever as a queryable database. That means you want a query-language that is fully-capable of doing both projection (and hierarchically) but also transformation and summarization, in the style that SQL can do for relational data.

And yoou want that query-language to be succinct and amenable to scripting-like accretive programming. Again, this is what you get when you use SQL against relational DBs.

Literally, once upon a time an esteemed professor from grad school called me up and said “Chet, I have this JSON data and I need to extract thus-and-such, summarize thus-and-such, can you help?” and I wrote him a JQ script in, like, ten minutes. And that time also, I had to remember JQ in order to write the script.

1 Like

As an aside, SQL databases are really starting to have this kind of features for JSON too! Postgres has had it for a while, and sqlite recently added shorter forms for its JSON extension. So you can mix JSON and other relational data in a sqlite file and use the query language!

I think we need to add support for YAML and JSON in the stdlib. These are critical wire-formats and datatypes all over the industry, and if OCaml is going to remain relevant, it needs to support them. I hear that TOML is coming on fast, and so we oughta support that too.

This, but entirely unironically. Yaml is more of a config language, but you can’t lift a stone without finding a handful of json files beneath these days. Not having JSON support is like not having utf-8 support (… which we only just added recently?).

When you write compilers like it’s 2000, it’s fine not to have JSON. But as soon as you touch anything networked, http and JSON are everywhere (and also, hex, base64, etc. but I digress). It’s a pity not to have them. Even for compiler-style programs, provers, etc. these days it’s hard not to want to provide JSON diagnostics or metadata — e.g. META files done today would very likely use that format.

2 Likes

Multiple thoughts:

  1. Have you heard of “jsonnet” ? It’s a weird functional language, that computes over JSON. Erm, that is, its origins are a language that computes over protocol buffers, but since protobufs are really JSON … Places like Databricks use it to compute the config-files for cloud deployments and such.

  2. That is to say, computing over JSON is a big, big, big problem, and “well, write a bunch of code in your favorite programming language” isn’t an answer (as you rightly note) b/c boy howdy, that’ll take forever, and be about as maintainable as “write a bunch of code that accesses btree indexes directly” was at the dawn of relational databases.

We need query and computation languages over JSON for the same reason we needed relational query languages.

  1. YAML ought to be[1] just another syntax for JSON. Most people who use YAML, use it as precisely that. And boy howdy, everything you need for JSON, you also need for YAML. B/c the size of these YAML files that are generated by auto-configurators, and that then you have to modify by hand … geeeeeez.

[1] YAML has these weird bits of syntax that most people don’t use, b/c they recognize that different YAML parsers accept different subsets of the language. It’s all a big mess. I came up with my own subset, designed so that one could write a parser in any language that would accept that language on-the-nose, but hey, not like I can convince anybody to use it: GitHub - chetmurthy/yay: YAY Ain't YAML

FYI, for JSON, Tezt has some combinators that I find very nice to use in practice: tezt/lib/JSON.mli · master · Tezos / tezos · GitLab

You write stuff like JSON.(json |-> "people" |=> index |-> "name" |> as_string).

One nice feature is error messages give nice locations. For instance, assuming json has an origin of "file.json" (the origin is set at parse time), you can get errors like:

file.json, at people: not an array
file.json, at people.[42]: not an object
file.json, at people.[42]: missing field: name
file.json, at people.[42].name: not a string

A typical example to decode a record would be:

let decode_vector json = {
  x = JSON.(json |-> "x" |> as_float);
  y = JSON.(json |-> "y" |> as_float);
}

One drawback is this approach is quadratic in the number of record fields though. The alternative is to iterate over each field one by one, setting mutable values along the way, before packing all those mutable values in a record, but that’s quite annoying to write.

I’m sure something like this can be done for s-expressions.

This is one of the reasons you want a query language and not a set of combinators.

I fail to see how a query language would help to convert JSON values to actual OCaml records, do you have an example? Unless the query language comes with a PPX or something like that, in which case the problem would be solved by the PPX itself, not the query language, right?

I was responding to @gasche 's original problem. There, the issue is to construct some cut-down data-structure from the original full sexp/JSON. In an earlier comment in this thread, I mentioned that I’d written an OCaml implementation of @stedolan 's jq; so did someone else: Query-json: Re-implemented jq in Reason Native/OCaml ) ; two thoughts:

  1. this would allow the query-engine to produce an OCaml JSON value
  2. at least when I wrote my interpreter, it was straightforward to imagine how to produce instead a code-generator, which could easily be converted into a PPX.

Now that I think about it, it seems … obvious that we could repurpose jq to solve your problem pretty much on-the-nose. Imagine:

  1. s-expressions of the form ((a b) (c d)...) are treated as JSON dicts.
  2. otherwise, s-expressions of the form (e1 e2 ...) are treated as JSON lists
  3. other cons nodes are errors. Or maybe we invent syntax to do car/cadr
  4. and everything else maps to strings.

This is a simple transformation of JSON to s-expressions, and you could construct the reverse, so that sexp->json->sexp is the identity function (ignoring case #3).

Then, one could just reuse JQ to do the querying.

It seems like that’d solve your problem? And since there are two different JQ implementations in OCaml …

This library might be what I was looking for for parsing the kicad schematics file format (for instance here ).

However, sorry to piggyback on this thread, but I don’t see how to parse a list of a sum type such as the description of a symbol

(symbol "Conn_Coaxial_0_1"
        (arc (start -1.778 0.508) (end 1.778 0) (radius (at -0.0508 0) (length 1.8034) (angles 163.6 0))
          (stroke (width 0.254)) (fill (type none))
        )
        (arc (start 1.778 0) (end -1.778 -0.508) (radius (at -0.0254 0) (length 1.8034) (angles 0 -163.8))
          (stroke (width 0.254)) (fill (type none))
        )
        (circle (center 0 0) (radius 0.508) (stroke (width 0.2032)) (fill (type none)))
        (polyline
          (pts
            (xy -2.54 0)
            (xy -0.508 0)
          )
          (stroke (width 0)) (fill (type none))
        )
        (polyline
          (pts
            (xy 0 -2.54)
            (xy 0 -1.778)
          )
          (stroke (width 0)) (fill (type none))
        )
      )

The gitlab instance is not opened to external users, so its issue system is quite useless.

Thanks.

1 Like

I just wrote a parser for my input data using Decoders_sexplib, and the result works. It’s the only library recommended in the thread that solves my specific problem, so far.

  open Decoders_sexplib.Decode

  let module_deps_decoder =
    let+ for_intf = field "for_intf" (list string)
    and+ for_impl = field "for_impl" (list string)
    in for_intf @ for_impl

  let module_decoder entry_name =
    let+ name = field "name" string
    and+ impl = field "impl" (list string) |> map List.hd
    and+ deps = field "module_deps" module_deps_decoder
    in (entry_name, name, impl, deps)

  let exec_decoder =
    let* entry_name = field "names" (list string) |> map List.hd in
    field "modules" (list (module_decoder entry_name))

  let lib_decoder =
    let* entry_name = field "name" string in
    field "modules" (list (module_decoder entry_name))
    
  let entry_decoder =
    list_filter (
      string |> uncons @@ fun kind ->
      match kind with
      | "executables" -> let+ v = list exec_decoder in Some v
      | "library" -> let+ v = list lib_decoder in Some v
      | _ -> succeed None
    )
    |> map List.flatten
    |> map List.flatten

Note that entry_decoder is an example of a decoder working on a sum type / variant as you mention:

  1. in this example I use list_filter to only handle the variants executable and library and ignore the others
  2. there is an extra level of list wrapping (and List.flatten in the result), due to I think the inner working of the Decoders library which was designed with JSON rather than s-exprs in mind. I’m not sure but I think that it normalizes (polyline foo bar) into something like (polyline (foo bar)).
1 Like

Yes, gitlab.inria.fr is a bad place to host community-oriented free software. Unfortunately the admins are aware of it and they don’t want to change this (and don’t have the workforce resources to change it), and most users don’t think too deeply about the implications of their hosting choice, or are not aware of the problem. I think the best route is to ping the authors kindly (here @esope) to see if they would consider hosting their software on gitlab.com instead – or some other place.

(Hopefully those problems will magically solve themselves once we have proper federation between git forges…)

1 Like

Just curious what didn’t work with Sexpq ? I didn’t follow closely but I don’t see what you couldn’t possibly express.

That’s one of problem with s-expressions. There is no well defined encoding of dictionnaries, or rather people who write them by hand do not want to use the clean ones.

In lisp you would write them as a list of bindings, a binding being (key . <s-exp>).

In config files it seems no one wants to write that .. You could do (key <s-exp>) but it seems again no one wants to write the extra parens when you bind a key to a list.

So we end up with this bastardized notion of binding which is not so great since you can no longer distinguish between a binding to a singleton list and a binding to an atom without external knowledge (it also makes substitution and other operations harder than it could be).

I didn’t try serialk because I understand that it’s not released / available on OPAM yet. The design looks nice, but I’m planning to include my sexp-extraction code in a PR in an upstream project and I want to stay with opam-released dependencies.

Unrelated: one thing I appreciate about Decoders (and I guess Serialk too) is that thought was given to error reporting. It’s not something that my quick&dirty hand-written do, and I think that’s a large part of the value of using a specialized library.