Sharing Menhir boilerplate between parsers with different token sets in different dune libraries


I’m using Menhir to parse DIY/Herdtools style litmus tests. These tend to describe multi-threaded programs in some concrete language (C, assembly, etc) and consist of some representation of those programs alongside some auxiliary information (name, initial variable values, final state postcondition, etc). The auxiliary information depends on the concrete language’s representation of integer constants.

Currently, I have a single flat parser for parsing litmus tests in C. The toplevel of said parser looks mostly like this, in new-Menhir syntax:

let top := ~ = ID; ~ = ID; ~ = declaration+; EOF; <(* ... *)>
let declaration :=
  | ~ = initialiser;              <(* ... *)>
  | ~ = postcondition;            <(* ... *)>
  | ~ = locations;                <(* ... *)>
  | ~ = (* c translation unit *); <(* ... *)>

let init_stm      := ~ = ID; EQ; ~ = (* C constant *); <(* ... *)>
let initialiser   := braced(list(endsemi(init_stm)))
let locations     := LIT_LOCATIONS; bracketed(slist(identifier))
let quantifier    := LIT_EXISTS; <(* ... *)>
let disjunct      := conjunct | ~ = disjunct; LIT_OR; r = conjunct; <(* ... *)>
let conjunct      := equality | ~ = conjunct; LIT_AND; r = equality; <(* ... *)>
let equality      := ~ = parened(disjunct); <(* ... *)>  | ~ = identifier; EQ_OP; ~ = (* C constant *); <(* ... *)>
let identifier    := ~ = ID; <(* ... *)> | ~ = INT_LIT; COLON; ~ = ID; <(* ... *)>
let postcondition := ~ = quantifier; ~ = parened(disjunct); <(* ... *)>

let slist(x)   == separated_list(SEMI, x)
let braced(x)  == delimited(LBRACE, x, RBRACE)
let parened(x) == delimited(LPAREN, x, RPAREN)
let endsemi(x) == terminated(x, SEMI)

As is probably obvious, this depends on many (mostly language-neutral) tokens (SEMI, LBRACE, etc), and also on two parts of the C parser (translation units and constants).
Everything else would generalise to other languages (x86 assembly, etc.), and so, in an I’d like the above parser toplevel to live in a ‘common’ dune library and the various languages to merge it into their own parsers. Each language’s tokeniser (which I’m implementing in sedlex) would supply the tokens referenced above.

I’m running into several apparent issues when trying to do this in practice:

  • I can’t seemingly use Dune’s merge_into, as the parsers straddle library boundaries.
  • I can’t immediately pull this bit of the module out and expose it as a regular parser in the ‘common’ library, as then Dune’ll try to build it as a full parser and it’ll complain about missing tokens and productions.
  • I could make the parser rules parametric over the C productions, but then the parser would still be missing the tokens.
  • I can’t find an appealing way to make the parser parametric over the tokens without promoting them to productions and parameterising the whole set of productions above over them. There are a lot of tokens used above, and so the parameter lists would become very unwieldy.
  • Ideally, I’d use Menhir’s parameter support to make the parser a functor over the tokens (or, rather, a partial view of them: something like:
    %parameter<Tokens : sig
      type token
      val LBRACE: token
      val RBRACE: token (* ... *)
    then pass through the productions in a parametric way (or somehow also supply them in the parameter). This doesn’t seem to work—as far as I can tell, Menhir is looking for tokens (and productions?) in a separate namespace; and even so, it probably wants a more concrete representation than the above. I can’t give it the whole token set, as it differs from language to language.
  • Even if I work around all this, I’ll end up at the same problem of not being (sure how to/able to) reference a library of productions in one Dune library from a parser in another one.

Is there some less painful way to do this than duplicating the whole litmus boilerplate for each language, or is this actually a hard limitation of the system?

~ Matt

I’m not really an expert in menhir, but would it help to share the common parts at the source level? i.e. copying a common.mly file between the various libraries? i.e. something like this:

(rule (copy# %{lib:common:common.mly} common.mly))

The %{lib:common:common.mly} refers to file common.mly from library common and copy# means to copy the file and add a line directive so that errors are reported in the original file.

1 Like