Your favorite Menhir tricks and fanciness

[full disclosure: I’m not a Menhir fanboi, b/c prefer LL(1). But I still recognize that it’s quite advanced and well-developed. Lots of stuff there that people don’t know commonly, but should.]

After the thread about hand-written parsers, and the back-and-forth about LALR(1) parser-generators and their historical limitations, I thought it might be interesting to see what people find really cool in Menhir, and the cool ways they use it. Or even “stuff I saw people do with Menhir, that I was quite impressed by”. I’ve read here-and-there about Menhir’s grammar-conflict checking, and also its runtime support for error-detection and recovery, but don’t actually know much about it.

I’ll go first: Menhir has support for “macros” (the %inline directive, example at page 17 in the manual ). This is wicked stuff, and I am very impressed. Here’s the example from that page:

expression:
| i = INT { i }
| e = expression; o = op; f = expression { o e f }

%inline op:
| PLUS { ( + ) }
| TIMES { ( * ) }

The nonterminal op gets inlined into the nonterminal expression in the obvious way. Wicked.

4 Likes

I quite like the possibility to make rules that have parameters : this allows menhir to have a standard library, but you actually can make your own :

option(X):
| {None}
|  x=X{Some x}

They can also be inlined

1 Like

Menhir standard library! I never knew. Pages 17-18 of the manual, btw. Lovely!

1 Like

Yes its very nice !
Beside the quite obvious use cases that are given by the standard library, I have seen that feature used to add locations to anything this way :

%public %inline located(X):
x=X { Position.with_poss $startpos $endpos x }

After that, in every rule you can use located(expr) if you want an expr with its position added. Obviously the ocaml code here may vary, this is why this is not part of the stdlib.

4 Likes

My two favourite features of Menhir, when compared to other parser generators, are checkpoints (or Incremental Interface), and the ability to marshal the intermediate representations (*.cmly files), that is the grammar and the automata.

By the way Menhir is LR(1), not just LALR(1), so it can accept slightly more grammar and the conflicts are less ad hoc. Also, my two favourite features of LR are that parsing is linear and that is has the prefix property: the parser will signal an error on the first erroneous token. The consumed prefix always admits at least one continuation.

Checkpoints & Inspection

The traditional Yacc interface grabs control. You call the parser passing a lexer as arguments. The parser invokes the lexer as needed and either succeeds giving you the final semantic value, or fails without much information.

The incremental interface of Menhir (available in the table mode, by passing the --table flag) allows inverting this control flow : rather than calling a function to do the whole analysis, the parser continuation is reified as an OCaml value.

A set of functions is provided to operate on these continuations:

  • create new parsers, each entry point of the grammar becoming a value of type a checkpoint, where a is the type of semantic values.
  • advance a parser by repeatedly calling offer that takes a checkpoint and a token and can:
    • return a value of type a if parsing finished
    • return a continuation if more tokens are needed
    • reject the entry if the token cannot be consumed from the checkpoint

The parser state is immutable, so as long as semantic actions have no side effect, you can reuse a previous parser to, say, explore multiple parses. The inspection interface optionally generated by Menhir (--inspection) lets you look at the content of the parser check point (current state, current stack).

MenhirSdk & Cmly

With the --cmly flag, Menhir will produce a parser.cmly file. It contains most intermediate structures, saved for further inspection. The MenhirSdk package that allows loading these structures in an OCaml program to implement custom analyses, without having to touch Menhir implementation.

Use-cases

I used the feature above to solve many problems that commonly arise when parsing and for which a traditional LR parser falls short.

Error recovery in Merlin

Merlin has to produce an AST even in presence of errors (most of the time, some form of partial input).

By combining a static analysis using --cmly (finding sequence of actions that allows exiting from all syntactic constructs) and a dynamic one using --inspection (a flavour of Dijkstra’s shortest path that applies action on the parsing stack), Merlin can always produce an AST.

The prefix property guarantees that no information from the valid prefix is lost (this is relevant while typing because very often, the code before the cursor is correct).

The static analysis guarantee that the recovery is complete: it is always possible to recover (and points out problematic constructions if not).

The dynamic one generates short completions and tries to balance between finishing existing syntactic construction and consuming more input.

And since the analysis is mostly grammar agnostic, we have been able to use it for many OCaml and Reason versions without significant maintenance.

Parsing beyond LR

Sometimes one needs to parse languages that are not LR. Unfortunately, Menhir does not implement the GLR algorithm, so there is no support for these languages out-of-the-box.

But the parsing checkpoints let you implement backtracking manually. Yann Régis-Gianas describes such a use in Morbig. I also used that for parsing Reason and Hack languages.

It also makes it simple to express some “meta-syntactic” rules, such as “inserts ; whenever this is not ambiguous and would permit continuing the parse” (Reason does that).

Syntactic completion

The Syntactic completion PR extends Merlin completion to suggest grammatical constructions. The completions are derived automatically from the grammar, with some hints and heuristics to determine how they should be presented.

Sasano & Choi A text-based syntax completion method using LR parsing presented at PEPM21 nicely introduces this concept for LR grammar (although our algorithm is different). You can also look at the related Spoofax Completion Framework which is not based on LR).

Coming soon: declarative error message generation

I am currently working on a new framework for producing error messages using Menhir. It should come as an alternative to the existing error infrastructure, though we plan to find a way to benefit from both.

This is still a work in progress, but to give a taste of it, on

let x = 5 ;
let y = 6
let z = 7

our current prototype marks the ; as potentially problematic, because it prevents the let x = 5 from being parsed as a separate top-level definition.

Once again, this was developed outside of Menhir, by reusing the CMLY and the checkpoints.

21 Likes

One of my favourite missing features from menhir is the ability to specify abstract precedence levels with a partial order between them, as provided by Dypgen’s %relation keyword.

2 Likes

I came to like the incremental API and the declarative error reporting that you can use if you switch to the incremental API. However, merging and updating error description files is still a bit of a headache, even though it has improved, and I’m sure will keep improving.

2 Likes

My favorite feature is naturally the conflict explanation, which only requires the --explain flag to trigger (and should be the default mode, really). Here is for instance one of the conflicts you obtain if you forget to set the priority of INT higher than function application in an ML-ish language.

** Conflict (shift/reduce) in state 19.
** Token involved: INT
** This state is reached from topexpr after reading:

FUN name RIGHTARROW simple_expr

** The derivations that appear below have the following common factor:
** (The question mark symbol (?) represents the spot where the derivations begin to differ.)

topexpr 
expr SEMISEMI 
(?)

** In state 19, looking ahead at INT, reducing production
** expr -> simple_expr
** is permitted because of the following sub-derivation:

simple_expr list_expr // lookahead token appears because list_expr can begin with INT
value // lookahead token is inherited
FUN name RIGHTARROW expr // lookahead token is inherited
                    simple_expr . 

** In state 19, looking ahead at INT, shifting is permitted
** because of the following sub-derivation:

simple_expr 
value 
FUN name RIGHTARROW expr 
                    simple_expr list_expr 
                                simple_expr 
                                value 
                                . INT 

With fairly limited know-how, you can tell that the parser doesn’t know what to do when confronted with the situation fun x -> <e> INT .Either it parses it like (fun x -> <e>) INT or fun x -> (<e> INT). Priority naturally solves that question.
The error messages are both exhaustive and concrete, expressed in terms of parsing rules, which is highly effective when debugging a grammar.

2 Likes

I tend to prefer the explanations from Bison because they are less verbose and thus more readable. (Your mileage may vary.) But I agree with you that it is unfortunate that this is not the default behavior of this kind of tools.

a.y: warning: shift/reduce conflict on token NAME [-Wcounterexamples]
  Example: expr expr . NAME
  Shift derivation
    expr
    `-> expr expr
             `-> expr expr
                      `-> . NAME
  Reduce derivation
    expr
    `-> expr            expr
        `-> expr expr . `-> NAME
1 Like

Where is your parser for Hack using menhir and this checkpoint-to-emulate-GLR
thing?

You can find it here, freshly open-sourced thanks to help from @vsiles.
The parser is based on your original Pfff php parser, the lexing hacks proved very useful.
The “emulate GLR” stuff can be seen in the driver: it maintains a list of parsers, with some logic for backtracking and trying different parses for non-LR features of Hack.

This parser is close to upstream Hack but not completely in sync. It was able to parse all Hack code I could find two years ago, but does not support newer features such as enum classes.

1 Like

I’m curious, what programming languages need GLR for parsing rather than LR (1) ? Are there a lot? Few?

I am fascinated by tree sitter. After humans describe a syntax in a javascript file, tree sitter produces a language description in the form of a JSON file which can be machine read to produce a parser.

This JSON could be translated to a menhir grammar specification rather than go through tree-sitters own pipeline. Tree sitter is famously incremental – menhir is incremental too. But it seems that tree sitter falls back on a GLR algorithm which unfortunately we don’t have with menhir.

See this:

conflicts - an array of arrays of rule names. Each inner array represents a set of rules that’s involved in an LR(1) conflict that is intended to exist in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If multiple parses end up succeeding, Tree-sitter will pick the subtree whose corresponding rule has the highest total dynamic precedence .

https://tree-sitter.github.io/tree-sitter/creating-parsers

So I guess my question is: could it be possible to use produce a parsing specification in the menhir DSL from tree sitter’s JSON grammar description instead of tree-sitter’s own pipeline? Would that give us some advantages? Would it be possible to deal with the GLR capability not being available?

Regarding the needs for GLR, in my experience almost all modern languages need it. If you look at all the tree-sitter-xxx grammars on github and look at the conflicts[] section, you’ll see they all use that capability (e.g., in Java to disambiguate A < B … which can be a comparison or the start of the use of generics).

Regarding tree-sitter and this JS/JSON mess, I partially agree which is why we did GitHub - returntocorp/ocaml-tree-sitter-semgrep: Generate parsers from tree-sitter grammars extended to support Semgrep patterns

Can the treesitter parser exposed in ocaml in semgrep operate incrementally, potentially on every keystroke? Or is it only designed for batch parsing?

It is designed for batch parsing. Cc @mjambon who is the tech lead of ocaml-tree-sitter-semgrep.

Slightly off topic, but w.r.t to running on every keystroke, you can get pretty fluent/efficient interfaces just using batch parsers and some careful use of timeouts - gopcaml-mode uses OCaml’s own parser to provide structural understanding of the OCaml AST to GNU Emacs, and as parsing occurs in just a few ms, you can get pretty far by just waiting a for a second of user-idling and then run the parser - this results in a pretty seamless user experience.

1 Like