"locations" in s-expression/json for generated deserializers (a trip report)

[ETA: the thrust of this little trip-report, is: when you’re demarshalling JSON (or some other wireline) and you don’t know the -precise- format of that wireline, it helps to have generated deserializers that are identical in every way to the ones you’d use if you -did- have perfect knowledge, but produce useful and very precise error-messages when they fail. So you can iteratively arrive at the format of the wireline via a process of debugging.]

A while back, I suggested in an issue that it might be nice to have versions of sexp and yojson that carried locations from their source. Something like Ploc.t in:

type t = Atom of Ploc.t * string | List of Ploc.t * t list
type _t =
  [ `Assoc of (string * t) list
  | `Bool of bool
  | `Float of float
  | `Int of int
  | `Intlit of string
  | `List of t list
  | `Null
  | `String of string ]
and t = Ploc.t * _t

Recently I’m doing a bit of “reverse-engineering” and needed this, so I implemented it. Here’s the story:

(1) suppose I have some Python code (a LOT of it, which I don’t understand well) and want to port it to OCaml.

(2) A way to do that, would be to instrument the Python code to produce a -trace- that can be parsed by OCaml.

(3) and then write OCaml code to execute that trace using data-structures similar to those from Python.

The key advantage of this is that you don’t have to figure out how the Python code works a priori, figure out all the clevernesses, the corner-cases, etc. Instead, if you have a sufficiently large (read: “enormous”) test-suite, you can use that to learn the code as you port. Just get the test-suite producing that -trace-, and your OCaml code parsing it. Then start implementing, bit-by-bit, your OCaml work-alike. Heck, you can have the OCaml code produce the same log, so you can have a further logfile comparison, too!

There’s a concept from PL for this methodology: “bisimulation”. It replaces intelligence with massive test-coverage PLUS intrusive instrumentation/logging.

-A- problem with this approach, is that for any significant Python code, the data-structures will be complex, so the serializer-to-log and the OCaml deserializer-from-log will be similarly complex. But haha, it’s really, really easy to dump JSON from Python, and similarly easy to parse JSON from OCaml.

Problem: JSON is an ugly data type, and you wouldn’t want to write the functions to convert that JSON back into OCaml data structures by-hand. But lucky you, you have PPX derivers that can do it for you!

So you write an OCaml type that looks like your Python data-structure’s outputted JSON, and try it out – try to demarshal bits of your Python-generated log. And BOOM! you get errors during deserialization. And they’re not very helpful, b/c your Python-generated JSON blobs are -big-. Somewhere inside there, something blew up. So much fun! So much winning! I’m tired already of the winning!

If I use a Yojson deserializer[1], I get:

error Mimick.json_log_t

and to be fair, it can’t go any further, really: it doesn’t have any -location- information from the original JSON it could adduce to tell me where to look! But if there -were- location information, then the generated deserializer (which is in every other respect identical to to the generated Yojson deserializer mentioned above) will output:

File "_build/CompositeLexers/LexerDelegatorInvokesDelegateRule/json.log", line 6, characters 19-217:
: error Mimick.prediction_context_t

which leads me much closer to the JSON that wasn’t deserializable. It isn’t perfect, but it’s so, so much easier!

[1] I should note that the “Yojson deserializer” I mention above, is one generated by my pa_ppx code, so not the official ppx_deriving_yojson. It’s possible that that deserializer generates better error-messages. But even so, when you’re processing a megabyte of JSON logs, spread across 300+ files, you want as precise error-messages as possible, and that means -locations-.

FWIW jsont has pretty good error reporting abilities.

Daniel,

I am -sorely- tempted to write a PPX deriver that uses jsont as a target for deserialization. After I get done this port …

In fact I think @vds already went there.