[ANN] Orsetto: structured data interchange languages (preview release)

I am pleased to announce that I’ve reached the preview milestone I set for my Orsetto project. As I wrote in the README file about it:

Orsetto is a standalone library comprising a core toolkit…

  • Core functional data structures and processes.
  • Unicode transport, normalization, parsing and formatting.
  • General purpose packet format encoder-decoder processes.

…used to implement streaming parsers and formatters for a useful variety of
structured data interchange languages…

In the preview release, the major featured languages are only JSON and CBOR, but my hope is to expand this list to include a variety of other useful languages. The programming interfaces are sufficiently different from other implementations that I feel Orsetto may be a welcome alternative to have available.

Orsetto is currently available from my personal OPAM repository, which you can use in the conventional way:

opam repository add jhwoodyatt git+https://bitbucket.org/jhw/opam-personal.git

In two weeks, unless discussion here convinces me to delay or defer, then I will request to make Orsetto available on the public OPAM repository along with a commitment to make patch releases as necessary to correct errors.

At this time, I’m inviting the OCaml community to give it a look, post comments and questions about it here, file issues on the Issue tracker if you notice anything wrong. I’m especially interested in knowing about name conflicts that I need to avoid. Once I push to the public OPAM repository, I want to be able to move quickly toward its first stable release.

—james

2 Likes

Some news.

  • I pushed another preview revision to my personal OPAM repository. It just upgrades from Unicode 11 to Unicode 12.
  • I’ve been stalling on releasing 1.0 to the community OPAM repository because I’ve been waiting to see how much I would need to do to support OCaml 4.08, except sadly 4.08+beta2 is incompatible with ppx_tools and ppx_migrate_parsetools, on which Orsetto has a dependency. I’m trying to decide whether to drop the dependency or wait a little while longer. The more I look at the PPX world, the less robust it looks— I’m strongly leaning toward dropping the dependency.

I think if you can afford it you should do it. Somehow ppx_migrate_parsetools has to follow the evolution of the ast so there will always be a bit of lag on new OCaml releases and/or it will often break if you want to test trunk.

1 Like

It’s not just ppx_migrate_parsetools. There is also the split between ppx_tools and ppx_tools_versioned, which appear to me as competing forks of the same library, both actively under separate maintenance and each used extensively by the community. This is also a problem for the multicore port, which also has new syntax that PPX needs to know about. It seems like everything about PPX smells like “experimental” and it’s weird that so much of the OPAM directory is dependent on these tools that are so tightly coupled to the abstract syntax.

I have released ~preview3 which improves compatibility with OCaml 4.08+beta2, drops the dependency on ppx_deriving and adds a dependency on stdlib-shims which I hope will maintain compatibility with the main compiler beta packages more closely than the PPX world seems to be tracking them.

I have now released ~preview4 which resolves Issue #8 OCaml 4.07: the new Stdlib.Seq.t is functionally equivalent to Cf_seq.t. For OCaml 4.06, this introduces an external dependency on the seq compatibility package. I’ve also checked that documentary comments are available with odig, so this might be the last preview release before 1.0. (It depends on whether I decide to remove the support for the ppx_let syntax extension.)

It depends on whether I decide to remove the support for the ppx_let syntax extension.

I’ve thought about this, and I will not be removing support for the ppx_let syntax extension. I plan to deprecate it when OCaml 4.08 is released, but it will be retained while I continue supporting OCaml 4.06 and 4.07.

Hi, have you thought about Thrift & protobufs support? I mention b/c … well, as a systems-builder, whenever I reach for a distributed system, I’m also reaching for Thrift, b/c inevitably I need to support some client/server written in C++. [Of course, they’re also designed for performance] Just a thought …

p1. I’ve not given a lot of thought Thrift because of how its RPC semantics are so tightly coupled with its specification. My attention has mainly been focused on structured data interchange languages.

p2. I’m still thinking out how to deal with structured data modeling languages that are tightly coupled to their corresponding interchange languages, e.g. ASN.1 and BER/DER/xER; Google Protocol Buffers; YANG and NETCONF; CDDL and CBOR; et cetera.

You may want to follow issue #37 and #38, which are about Google Protocol Buffers and generic structured data modeling respectively.

I agree with you about p2: typically your choice of IDL is tantamount to picking your wire-format. But p1? Must disagree: both protobufs and thrift can and are used for encoding data-at-rest. E.g. it is documented that at Google, a number of data-management systems (e.g. Dremel) store rows in Protobuf format. It’s actually a really great thing, b/c
(1) the type language is not-so-impoverished
(2) you get a decent set of tools for dealing with it,
(3) cross-language interoperability (insanely important for any nontrivial system: languages come and go; the -data-, that lasts a LONG time)
(4) you can use the (limited) version-to-version compatibility built into protobufs thift to (within limits) evolve your data-type-definitions.

I felt I should add that protobufs support two wirelines: the binary one, and a text-mode format, that is …lovely for writing config files. So you never again define config-file formats. I liked this so much, I implemented human-readable JSON wireline encoding for thrift (which has several wirelines, but they’re all pretty binary), so I could do this very thing. And again, because it’s a cross-language IDL/format, you can use those config-files from any supported language.

I cannot begin to properly kvell over the utility of this simple idea (“use a modern IDL for describing your config-files”).

I agree those are points in its favor, and I’m thinking about it seriously.

One reason I stayed away from it during initial development is that I was working for Google at the time and my employment agreement prohibited me from working on hobby projects that might interfere with Google objectives. I felt it was too risky for me personally to work on an OCaml implementation of Protocol Buffers given the statements of active disinterest by Google representatives on the Protocol Buffers discussion lists about OCaml support.

Since I’ve now left Google, I feel a lot more free to pursue that direction. I’m aware of the alternative textual format for Protocol buffer messages. I’ve seen it used as a configuration file format. Its tight coupling to the modeling language is both a blessing and a curse: one place it really breaks down in when you have any sort of partial data expansion, e.g. data templates.

I’m not arguing against adding it Orsetto, just explaining why it isn’t there now, and what I view as some of the complications. I feel like I need to explore the idea of a generalize data modeling abstraction further before I know how to get Protocol Buffers done right. (Also, I’m kinda not looking forward to the task of writing a protoc plug-in.)

FWIW, I had my own run-ins with the GRPC folks (from outside Google). It became clear to me that unless you’re building software to talk TO google’s systems, the GRPC folks aren’t interested in any problems you might encounter, nor in acceting any changes you propose. It’s not designed for use as a general-purpose RPC system, nosirree. I mean, the simplest thing (“I want to do my own socket/listen/accept loop, and hand GRPC connected sockets to run RPCs on”) is pretty much impossible without doing serious surgery on their code. Whereas in Thrift, it’s trivial (b/c Thrift was organized to make it straightforward).

For my money, Thrift is the way to go.