Hi,
I’m using Menhir to parse DIY/Herdtools style litmus tests. These tend to describe multi-threaded programs in some concrete language (C, assembly, etc) and consist of some representation of those programs alongside some auxiliary information (name, initial variable values, final state postcondition, etc). The auxiliary information depends on the concrete language’s representation of integer constants.
Currently, I have a single flat parser for parsing litmus tests in C. The toplevel of said parser looks mostly like this, in new-Menhir syntax:
let top := ~ = ID; ~ = ID; ~ = declaration+; EOF; <(* ... *)>
let declaration :=
| ~ = initialiser; <(* ... *)>
| ~ = postcondition; <(* ... *)>
| ~ = locations; <(* ... *)>
| ~ = (* c translation unit *); <(* ... *)>
let init_stm := ~ = ID; EQ; ~ = (* C constant *); <(* ... *)>
let initialiser := braced(list(endsemi(init_stm)))
let locations := LIT_LOCATIONS; bracketed(slist(identifier))
let quantifier := LIT_EXISTS; <(* ... *)>
let disjunct := conjunct | ~ = disjunct; LIT_OR; r = conjunct; <(* ... *)>
let conjunct := equality | ~ = conjunct; LIT_AND; r = equality; <(* ... *)>
let equality := ~ = parened(disjunct); <(* ... *)> | ~ = identifier; EQ_OP; ~ = (* C constant *); <(* ... *)>
let identifier := ~ = ID; <(* ... *)> | ~ = INT_LIT; COLON; ~ = ID; <(* ... *)>
let postcondition := ~ = quantifier; ~ = parened(disjunct); <(* ... *)>
let slist(x) == separated_list(SEMI, x)
let braced(x) == delimited(LBRACE, x, RBRACE)
let parened(x) == delimited(LPAREN, x, RPAREN)
let endsemi(x) == terminated(x, SEMI)
As is probably obvious, this depends on many (mostly language-neutral) tokens (SEMI
, LBRACE
, etc), and also on two parts of the C parser (translation units and constants).
Everything else would generalise to other languages (x86 assembly, etc.), and so, in an I’d like the above parser toplevel to live in a ‘common’ dune library and the various languages to merge it into their own parsers. Each language’s tokeniser (which I’m implementing in sedlex) would supply the tokens referenced above.
I’m running into several apparent issues when trying to do this in practice:
- I can’t seemingly use Dune’s
merge_into
, as the parsers straddle library boundaries. - I can’t immediately pull this bit of the module out and expose it as a regular parser in the ‘common’ library, as then Dune’ll try to build it as a full parser and it’ll complain about missing tokens and productions.
- I could make the parser rules parametric over the C productions, but then the parser would still be missing the tokens.
- I can’t find an appealing way to make the parser parametric over the tokens without promoting them to productions and parameterising the whole set of productions above over them. There are a lot of tokens used above, and so the parameter lists would become very unwieldy.
- Ideally, I’d use Menhir’s
parameter
support to make the parser a functor over the tokens (or, rather, a partial view of them: something like:
then pass through the productions in a parametric way (or somehow also supply them in the parameter). This doesn’t seem to work—as far as I can tell, Menhir is looking for tokens (and productions?) in a separate namespace; and even so, it probably wants a more concrete representation than the above. I can’t give it the whole token set, as it differs from language to language.%parameter<Tokens : sig type token val LBRACE: token val RBRACE: token (* ... *) end>
- Even if I work around all this, I’ll end up at the same problem of not being (sure how to/able to) reference a library of productions in one Dune library from a parser in another one.
Is there some less painful way to do this than duplicating the whole litmus boilerplate for each language, or is this actually a hard limitation of the system?
~ Matt