[ANN] New release of Menhir (20211230)

Dear OCaml & Menhir users,

I am pleased to announce a new release of Menhir, with a major improvement.

The code back-end has been rewritten from the ground up by Émile Trotignon
and by myself, and now produces efficient and well-typed OCaml code. The
infamous Obj.magic is not used any more.

Furthermore, the new code back-end produces code that is more aggressively
optimized, leading to a significant reduction in memory allocation and a
typical performance improvement of up to 20% compared to the previous code
back-end.

  opam update
  opam install menhir.20211230

Happy well-typed parsing in 2022!

2021/12/30

  • The code back-end has been rewritten from the ground up by Émile Trotignon
    and François Pottier, and now produces efficient and well-typed OCaml
    code. The infamous Obj.magic is not used any more.

    The table back-end and the Coq back-end are unaffected by this change.

    The main side effects of this change are as follows:

    • The code back-end now needs type information. This means that
      either Menhir’s type inference mechanism must be enabled
      (the easiest way of enabling it is to use Menhir via dune
      and to check that the dune-project file says
      (using menhir 2.0) or later)
      or the type of every nonterminal symbol must be
      explicitly given via a %type declaration.

    • The code back-end no longer allows the type of any symbol to be an
      open polymorphic variant type, such as [> `A ]. As a workaround,
      we suggest using a closed polymorphic variant instead.

    • The code back-end now adheres to the simplified error-handling strategy,
      as opposed to the legacy strategy.

      For grammars that do not use the error token, this makes no difference.

      For grammars that use the error token in the limited way permitted by
      the simplified strategy, this makes no difference either. The simplified
      strategy makes the following requirement: the error token should always
      appear at the end of a production, whose semantic action should abort the
      parser by raising an exception.

      Grammars that make more complex use of the error token, and therefore
      need the legacy strategy, cannot be compiled by the new code back-end.
      As a workaround, it is possible to switch to the table back-end (using
      --table --strategy legacy) or to the ancient code back-end (using
      --code-ancient). In the long run, we recommend abandoning the use of
      the error token
      . Support for the error token may be removed
      entirely at some point in the future.

    The original code back-end, which has been around since the early days of
    Menhir (2005), temporarily remains available (using --code-ancient). It
    will be removed at some point in the future.

    The new code back-end offers several levels of optimization, which remain
    undocumented and are subject to change in the future. At present, the main
    levels are roughly as follows:

    • -O 0 --represent-everything uses a uniform representation of the stack
      and produces straightforward code.
    • -O 0 uses a non-uniform representation of the stack; some stack cells
      have fewer fields; some stack cells disappear altogether.
    • -O 1 reduces memory traffic by moving PUSH operations so that they
      meet POP operations and cancel out.
    • -O 2 optimizes the reduction of unit productions (that is, productions
      whose right-hand side has length 1) by performing a limited amount of
      code specialization.

    The default level of optimization is the maximum level, -O 2.

  • The new command line switch --exn-carries-state causes the exception
    Error to carry an integer parameter: exception Error of int. When the
    parser detects a syntax error, the number of the current state is reported
    in this way. This allows the caller to select a suitable syntax error
    message, along the lines described in
    Section 11
    of the manual. This command line switch is currently supported by the code
    back-end only.

  • The $syntaxerror keyword is no longer supported.

  • Document the trick of wrapping module aliases in open struct ... end,
    like this: %{ open struct module alias M = MyLongModuleName end %}.
    This allows you to use the short name M in your grammar, but forces
    OCaml to infer types that refer to the long name MyLongModuleName.
    (Suggested by Frédéric Bour.)

15 Likes