Using Menhir to parse into idiomatic JS (TypeScript) structures

Hello and Happy New Year!

As we all know, despite being named an Objective Categorical Abstract Machine Language, OCaml is Obviously a Completely Awesome Meta Language. The awesomeness is founded on the wonderful language, but obviously completed by the extraordinary ecosystem of tools and techniques that take life in that language. :stuck_out_tongue_winking_eye:

This is a note to share a solution I hacked together using a handful of these awesome tools in our lovely language, but also a request for any suggestions on a more elegant solution to the problem posed.

The Problem

I wanted to use Menhir and Sedlex to write a fault-tolerant, incremental parser for a preexisting project that has an intermediate representation (IR), and a bunch of other tools, written in TypeScript.


Unfortunately, achieving this outcome was not quite as simple as adding a (mode js) to dune to have the generated parser compiled by Js_os_ocaml (Jsoo) . Of course, that does work like a charm, and if we could justify rewriting everything in OCaml, we’d be able to produce JS easy as pie. But for interop with the existing TypeScript code this won’t fly. We need to parse into JS objects that represent the IR in a human-readable way, ideally matching our existing TypeScript types, but Jsoo gives something like this.

> parser.parse('def foo(a,b) = 123')
[ 0, [ 0, [ 0, [Array], [Array] ], 0 ] ]

My next thought was to use Melange: it seems to be a great project with a lot of excellent work going in to it, and, iiuc, its purpose is precisely to compile OCaml into idiomatic(ish) JS. But I hit a road block right away, which led me to ask What are the limits and prerequisites of using dependencies with melange? and to try something else.

A Solution

Fortunately, the wealth of shining jewels-of-tools in the OCaml ecosystem made this short work. The solution I ended up with is hacky as heck, but its doing what I needed:

  • I define our types using the excellent atd.

  • I generate OCaml and TypeScript representations of the types, along with JSON serializers, via a dune config like

     (public_name lang_ir)
     (libraries atdgen))
    ;; The OCaml ser/de-serializers
     (deps    lang_ir.atd)
     (action  (run atdgen -j -j-std %{deps})))
    ;; The OCaml types
     (deps    lang_ir.atd)
     (action  (run atdgen -t %{deps})))
    ;; The TypeScript types and ser/de
     (targets lang_ir.ts)
     (deps    lang_ir.atd)
     (action  (run atdts %{deps})))
    ;; Conversion of the TypeScript into vanilla JS so I can test it with node
     (targets lang_ir.js)
     (deps    lang_ir.ts)
     (action  (run npx tsc %{deps})))
  • I use menhir and sedlex to define a parser that produces inhabitants of the OCaml types generated in (Working out the incremental, fault-tolerant parsing was its own exhilarating side quest, but I’ll save a report on that for it’s own post.)

  • Then I use Jsoo to run the parser in JS and then serialize its optimized but inscritable representation into the JSON dictated by atd:

      open Lang_parser_lib
      open Js_of_ocaml
      (* Export functions *)
      (* See *)
      let _ =
          method parse s =
              let lexbuf = Sedlexing.Utf8.from_string s in
              match parse lexbuf with (* run the parser *)
              | Some t -> Lang_ir.Quint_ir_j.string_of_t t (* produce a JSON string from the result*)
              | None -> ""
  • And, finally, I make a little wrapper.js that invokes the atd-generated deserializer to parse into the TypeScript representation:

    var ir = require('./_build/default/ir/lang_ir.js')
    var parser = require('./_build/default/js/lang_parser.bc.js')
    exports.parse = function (s) {
      return ir.readT(JSON.parse(parser.parse(s)))

The result

I now can use the wrapper script to parse into the nice TypeScript (compatible) structures I need:

[me@comp mparsing]$ node
> var p = require('./wrapper.js')
> p.parse("def foo(a,b) = 123")
    loc: { start_: [Object], end_: [Object] },
    v: { name: 'foo', params: [Array], body: [Object] }

I have three hopes for this post:

  1. I hope to contribute yet another note celebrating the virtues of our extraordinary programming language ecosystem.
  2. I hope it might be useful for others who need to solve similar problems.
  3. I hope there is a more elegant way to achieve this result (namely, without having to go through serialization) and that one of y’all can point the way.

:heart: :camel: