Pretty printer for custom data types best practices?

A tactic I’ve been recently using with some success for this problem is to leverage the work I’ve already done for this sort of thing in Orsetto.

Orsetto has its own extensible runtime type identification system— given the lack of one in Stdlib— which is available for use in both its JSON and CBOR parsing and formatting libraries. It is, for example possible to parse an opaque JSON value from an input stream and format that value as CBOR for output— without ever unpacking the structure of the value.

This is useful because it allows me to write one function for any given algebraic type that converts it to an opaque value that I can then serialize as either JSON text, a CBOR binary message, or even the textual CBOR diagnostic notation (an extension of JSON) as my needs arise.

One of the things I’m hoping to get around to writing, once the Ppx element of the OCaml platform matures further, is a deriving layer that derives these conversion functions automatically. Separation of application data structure types from the structured data interchange language used in serializing them by introducing a common intermediate representation seems like good design to me.

I kind of wish there were an extensible runtime type annotation system in the OCaml standard library. I’m not so full of myself as to suggest that Orsetto’s Cf_type and related logic should be systematically adopted by the OCaml core, but if something like it were to land in Stdlib at the direction of somebody with better API design skills than me, I would prefer to adopt it in Orsetto rather than continue to maintain my own crazy alternative.

While sexp is simple, JSON is only marginally more complicated but has a much stronger ecosystem: tools like jq are very useful. Databases now support JSON as a base type. My vote therefore would be for JSON as a uniform serialisation format.

2 Likes

I’m not going to disagree that JSON has a much stronger ecosystem, but I just wanted to advertise that sexps have a jq-like thing too: GitHub - janestreet/sexp: S-expression swiss knife :slight_smile:

2 Likes

Printing has overlapping but distinct use cases than serialization. Not all printers should be readable back into data: there could be hidden fields, opaque types, file descriptors, etc. I personally find sexprs more readable than json values (if only because there’s far less quoting in general), so they’re more amenable to being good printers.

If we’re talking about standard serialization, I think a mechanism independent of the concrete format (like rust’s serde) would be better anyway.

1 Like

Even serde uses serialisation to JSON as an example on its home page. Using internally a more orthogonal format than JSON is of course fine, but strong JSON support is simply driven by the power of its ecosystem. Exchange with the outside world is more important than having the cleverest serialisation format that you can only use within OCaml.

2 Likes

As I wrote elsewhere, the point is not JSON vs s-expression vs whatever format; 15 years ago everyone would have advocated for XML support (see what happened to it in scala).

The point is upstream provided infrastructure for helping with the serialisation problem, regardless of the format you need to target.

5 Likes

Of course, json is unavoidable these days. I just wanted to point out that serde separates the (de)serializable instances (defined only once) from their use with a particular serialization format. There are many libraries in opam that try to do that (including one I wrote a long time ago), but these are less useful if not standardized.

3 Likes

I’d like to add in agreement with what @dbuenzli wrote in that elsewhere:

I think it would be much more worthwhile for the community if the stdlib provided infrastructure for the serialization problem rather than provide a specific serialization format.

Earlier in that thread, he surfaced the need that arises in every structured data interchange language for meaningful signaling of decoding errors with accurate text and data locations. This metadata is sufficiently complex that the need often arises for representations of it in structured data interchange languages as well. Structured metadata.

I’m currently working on this exact problem in Orsetto. The approach I’m taking with it is only really possible when the concrete interchange language is an abstraction from the structure of the data.

Shorter james: concur with @dbuenzli. It would be better if Stdlib contained more infrastructure for structured data interchange than any new concrete interchange languages.

1 Like

FYI: https://conjury.atlassian.net/browse/ORS-79 asks for a login just to see the content. Do you have a publicly viewable version?

I’m afraid this thread has diverged from"how to pretty printing custom data types" to another run of will-never-happen proposals for extending the language with some sort of deriving/runtime type proposal. Sad.

1 Like

I think there’s another aspect that might be nice to make happen. There are a lot of type-deriver-based pretty-printers out there, and they have varying levels of support for the various OCaml types. If all the authors could get together and agree on one set of types that they all must support, and even better, a common standard benchmark that they all could run, that’d be most excellent.

Most obvious example I can think of: there is varying support for extensible variants. I think maybe there’s varying support for polymorphic variants, but not sure. Same, I kind of remember that there’s varying support for all the various renamings of primitive types (“int” vs “Stdlib.int” vs “Int.t”).

I’d be happy to contribute to such an effort, and to do my part for the derivers I’ve written.

This isn’t as simple as it looks: there’ll be lots of cases that don’t work, and will require #if conditionalization. But hopefully it would result in (over time) something really standardized for pretty-printing support, no matter which output format you want.

By the way, it might be useful in this regard, if there were a way to use the Format module without the “@[” and other notations being execiuted – it would make checking equality assertions about the output of pretty-printers a lot easier. Since I use Format via “Fmt”, I guess I should look into whether it’s possible to do that down in Fmt.

[I’m not expressing a -strong- opinion here, and could be convinced otherwise, so please don’t take this as some sort of vigorous pushback]

Two points in this note.

Why not invest more in macro-preprocessor support?

I don’t know if “generic programming” is the answer here. I went and read a good bit of the “spiny encoding” paper linked off the TPF repository (mentioned in pqwy@ 's comment that is just a little below yours) and also their example of “Sexps hands-free” and … it’s not so convincing.

It seems like generic programming is great, just great, for those who are really, really used to using a massive tower of type theory, but pretty impenetrable to those of us who want to use only a good bit of type theory, but not a massive tower?

I’d contrast that with macro-preprocessing, which (it seems to me) is far more approacheable for people who aren’t committed to reading Lambek&Scott (Introduction to Higher Order Categorical Logic) or some equivalent book (full disclosure: I read and thought I really grokked it, back in 1986; thankfully I’ve been able to recycle that cache memory for more useful things, like “we rate dogs” videos).

It’s pretty impenetrable stuff, and arguing that one ought to learn it, just as one ought to eat one’s Wheaties, is a heavy lift; it restricts the accessibility of this area of code to those who are willing to make the investment.

Instead of investing a lot of energy in using generic programming to make pretty-printing nicer, why not invest that energy in making macro-preprocessing nicer and easier? It’s fundamentally easier to approach for most programmers.

At least for “de/serialization”, there’s reason to think that type-based recursion isn’t enough

There’s been work done in “universal marshallers”: the “Concert” project and its follow-on “Mockingbird” (Josh Auerbach) from the early 90s at IBM Research (back when they did some). And that work showed that in order to produce “tight” mappings between a wireline and a memory-representation (of the sort that would arise from a hand-written IDL compiler) you had to do nontrivial analysis of the IDL type-structures. There were algorithms there that needed graph-isomorphism, if I remember right. Somehow I have a doubt that that’s going to happen using generic programming.

For sure, this doesn’t matter for pretty-printing, since more-or-less the author gets to define the format and the mapping from memory-structures. But for actual serialization, I think it might matter (since it did in the past).

It’s unclear to me whether we are looking a tower or a bump. I’m not claiming this is the solution but in any case I see it as the datatype “assembly” for what we are constantly using. May be worth learning and give it a try. It may also be just a matter of having better background material.

Personally from an end-to-end understanding I find it more penetrable and enlighting than the brittle ppx/camlp45 technology – that (changing) OCaml AST and build churn is pretty wild isn’t it ?

Besides by virtue of existing in the Meta Language now it has higher chances to pass this kind of test – my pockets are not deep enough to be able to throw the needed money at other people to repeatedly migrate my code bases from pre-processing “solutions” to “solutions”.

Now if you were talking about language integrated meta programming facilities, why not. But I’m looking for a reasonable and sustainable solution meanwhile :–)

3 Likes

Two thoughts:

  1. I’m not suggesting that you embrace camlp5, but rather, that you make PPX sufficiently approachable that programmers can write macros as part of their packages, as people used to do in the LISP/Scheme world. That most assuredly isn’t the case today, and the evidence is that people don’t write small PPX rewriters as part of their projects, preferring to instead do things “by hand”. Best example that comes to mind is ppx-migrate-parsetree itself.

  2. I lived a long time in the Church of Curry-Howard, so it’s not that I’m unaware of these higher-typed things, dependent typing, etc. But they take -space- in the brain, space that could be taken up by (for instance) transaction-processing knowledge, or detailed knowledge of how to work with RDMA, or lord-knows-what-else. It’s a trade-off that people make, as to what they learn fully, and what they learn only enough to get their work done.

Using higher-order type systems comes as a cost: it’s not free, and it’s not a molehill.

2 Likes

Alas, not yet. Atlassian recently announced that Bitbucket Issues is obsolescent and recommended migration to Jira. So I did, and then discovered afterward that an Atlassian account is required even for viewing Jira issues.

Which came as such a surprise that I’m still not sure I’m configuring everything correctly.

Accounts are available at various paid tiers and a free tier. I haven’t yet figured out whether I need to upgrade to paid account for my Jira issues to be publicly visible, and if so what tier I need to buy to make it happen. I’ll get that sorted at some point soon.

Update: I needed to sign up for the standard plan to enable anonymous browsing. So I just did that.

1 Like

It works, nice! :grin: