Lookahead in menhir parser

I would like to use “lookaheads” in my menhir parser, i.e. situations where
a reduction depends on a few tokens following the pattern, without those tokens getting absorbed in the reduction itself. An example is when you parse OCaml Code, you wait for an end-of-file or the start of a new definition to know when a value definition is finished.

I invented a very simple minimal example of that : a grammar with only three tokens A,B,C ; where there are only two accepted sequences of tokens, AB and ABC, and the A and the B are grouped differently according to whether there the sequence ends with a C or not.

Below is my failed attempt. What is the correct way to do this (if there is one) ?

Contents of my mly file :

%token <string> A
%token <string> B 
%token <string> C 

%{

  type result=
     AB_ALONE of string 
    |AB_BEFORE_C of string
    |C_ALONE of string

%}

%start <result list> main


%%

main: 
| str1=a_term str2=b_term str3=c_term { [AB_BEFORE_C(str1^str2);C_ALONE(str3)] }
| str1=a_term str2=b_term { [AB_ALONE(str1^str2)] }

a_term:
| str1=A {str1}
b_term:
| str1=B {str1}
c_term:
| str1=C {str1}


The output :

    $ menhir --interpret --interpret-show-cst my_parser.mly
    Warning: 2 states have an end-of-stream conflict.
    File "my_parser.mly", line 21, characters 2-25:
    Warning: production main -> a_term b_term is never reduced.
    Warning: in total, 1 production is never reduced.
    A B C
    ACCEPT
    [main: [a_term: A] [b_term: B] [c_term: C]]
    A B 
    OVERSHOOT

I don’t know if this is the correct way to do it, but one solution is to use a so-called “lexer hack”. That is, between lexing and parsing, you add an intermediate step that analyses tokens based on tokens you read just before, and then emits different tokens for different situations. It often implies analysing your tokens to know what other tokens may follow (e.g. using so-called FIRST or FOLLOW sets in parsing theory).

1 Like

Thank for your reply. I must admit though that I only very vaguely understand how it could be implemented. Could you illustrate your idea with some code on my (very minimal) example ?

I think it would help if you also mention some more details about what you’re parsing.

I can give one example of a lexer hack that I made in order to do semicolon insertion like it is done in Go. My lexer keeps track of the last token using a ref and when it reaches a newline it looks if the previous token is one of those which allow semicolon insertion, and if so, issues a SEMI token.

Then a parser parses it as if the semicolons were required.

1 Like

I already mentioned the case of value definitions in OCaml. In that case, I need to “insert double semicolons if needed” much the way you “insert semicolons if needed”.

I’m rather surprised though that you can do the hacking in the lexer directly. I’d have thought that you needed to do a large amount of parsing (not just lexing) before being able to know if a semicolon needs to be inserted. I’ll take a look at your code …

1 Like

The lexer hack is applied before parsing.

A simple way to do it is for your lexer to emit the list of recognized tokens.

Then you write a function hack that will be called by the parser: will yield a token (for the parser) every time it will be called.

So the smart (and error-prone) stuff happens in hack: hack reads the list of tokens emitted by the lexer and, depending on your specific rules, emits a new token.

For instance, if you know that in every sequence of tokens A B C, A should be understood as having a certain meaning but otherwise it has another, second meaning, then you emit A1 or A2 (and at later calls by the parser: B and C) rather than A. And your parser only knows tokens A1 and A2, not A (provided you don’t use A elsewhere of course)

Menhir has functions to control the parsing process yourself. You can see old (badly written!) examples here: lexer hack, parsing.

I’ve looked at the OCaml parsing code to try to understand how the optional ;; are handled. I don’t think I have a complete picture, but here’s the simplified grammar:

structure:
  | expression? structure_element*

structure_element:
  | SEMISEMI expression?
  | structure_item

So an expression can be followed by structure items without ;;, but to add another expression you need to add ;;. Also, any additional ;; are allowed, since the expression after SEMISEMI is optional: expression?.

I’m sure there are some precedence annotation involved to make all of this work together.

This example doesn’t need a lexer that is modified during parsing or anything similar to work as suggested by the others. Your problem can be fixed by adding an extra token ‘EOF’ that indicates the end of the input:

main: 
| str1=a_term str2=b_term str3=c_term EOF { [AB_BEFORE_C(str1^str2);C_ALONE(str3)] }
| str1=a_term str2=b_term EOF { [AB_ALONE(str1^str2)] }

This grammar goes through menhir with no errors, and as long as your lexer produces the ‘EOF’ token at the end of the input, it should work as expected. See Section 6.4 “End of stream conflicts” in the Menhir manual.

3 Likes

Right, I was only addressing how you may handle lookahead > 1.