Lookahead in menhir parser

jonathandoyle · February 10, 2019, 4:16pm

I would like to use “lookaheads” in my menhir parser, i.e. situations where
a reduction depends on a few tokens following the pattern, without those tokens getting absorbed in the reduction itself. An example is when you parse OCaml Code, you wait for an end-of-file or the start of a new definition to know when a value definition is finished.

I invented a very simple minimal example of that : a grammar with only three tokens A,B,C ; where there are only two accepted sequences of tokens, AB and ABC, and the A and the B are grouped differently according to whether there the sequence ends with a C or not.

Below is my failed attempt. What is the correct way to do this (if there is one) ?

Contents of my mly file :

%token <string> A
%token <string> B 
%token <string> C 

%{

  type result=
     AB_ALONE of string 
    |AB_BEFORE_C of string
    |C_ALONE of string

%}

%start <result list> main


%%

main: 
| str1=a_term str2=b_term str3=c_term { [AB_BEFORE_C(str1^str2);C_ALONE(str3)] }
| str1=a_term str2=b_term { [AB_ALONE(str1^str2)] }

a_term:
| str1=A {str1}
b_term:
| str1=B {str1}
c_term:
| str1=C {str1}


The output :

    $ menhir --interpret --interpret-show-cst my_parser.mly
    Warning: 2 states have an end-of-stream conflict.
    File "my_parser.mly", line 21, characters 2-25:
    Warning: production main -> a_term b_term is never reduced.
    Warning: in total, 1 production is never reduced.
    A B C
    ACCEPT
    [main: [a_term: A] [b_term: B] [c_term: C]]
    A B 
    OVERSHOOT

grayswandyr · February 11, 2019, 12:28pm

I don’t know if this is the correct way to do it, but one solution is to use a so-called “lexer hack”. That is, between lexing and parsing, you add an intermediate step that analyses tokens based on tokens you read just before, and then emits different tokens for different situations. It often implies analysing your tokens to know what other tokens may follow (e.g. using so-called FIRST or FOLLOW sets in parsing theory).

jonathandoyle · February 11, 2019, 1:04pm

Thank for your reply. I must admit though that I only very vaguely understand how it could be implemented. Could you illustrate your idea with some code on my (very minimal) example ?

keleshev · February 11, 2019, 1:14pm

I think it would help if you also mention some more details about what you’re parsing.

I can give one example of a lexer hack that I made in order to do semicolon insertion like it is done in Go. My lexer keeps track of the last token using a ref and when it reaches a newline it looks if the previous token is one of those which allow semicolon insertion, and if so, issues a SEMI token.

github.com

keleshev/quark.ml/blob/master/Lexer.mll#L79


                 | '0' (       (dec_digit | '_')* float_suffix?
                       | 'b'   ('1' | '0' | '_')+
                       | 'o'   (oct_digit | '_')+
                       | 'x'   (hex_digit | '_')+
                       )


rule read = parse
| [' ' '\t' '\r']+ { read lexbuf }
| '\n' {
    Lexing.new_line lexbuf;
    if is_semicolon_required !previous_token then
      return SEMI
    else
      read lexbuf
  }
| "{"  { return LBR     }
| "}"  { return RBR     }
| "["  { return LBK     }
| "]"  { return RBK     }
| "("  { return LPR     }
| ")"  { return RPR     }

Then a parser parses it as if the semicolons were required.

jonathandoyle · February 11, 2019, 1:51pm

I already mentioned the case of value definitions in OCaml. In that case, I need to “insert double semicolons if needed” much the way you “insert semicolons if needed”.

I’m rather surprised though that you can do the hacking in the lexer directly. I’d have thought that you needed to do a large amount of parsing (not just lexing) before being able to know if a semicolon needs to be inserted. I’ll take a look at your code …

grayswandyr · February 11, 2019, 2:22pm

The lexer hack is applied before parsing.

A simple way to do it is for your lexer to emit the list of recognized tokens.

Then you write a function hack that will be called by the parser: will yield a token (for the parser) every time it will be called.

So the smart (and error-prone) stuff happens in hack: hack reads the list of tokens emitted by the lexer and, depending on your specific rules, emits a new token.

For instance, if you know that in every sequence of tokens A B C, A should be understood as having a certain meaning but otherwise it has another, second meaning, then you emit A1 or A2 (and at later calls by the parser: B and C) rather than A. And your parser only knows tokens A1 and A2, not A (provided you don’t use A elsewhere of course)

Menhir has functions to control the parsing process yourself. You can see old (badly written!) examples here: lexer hack, parsing.

keleshev · February 11, 2019, 4:20pm

I’ve looked at the OCaml parsing code to try to understand how the optional ;; are handled. I don’t think I have a complete picture, but here’s the simplified grammar:

structure:
  | expression? structure_element*

structure_element:
  | SEMISEMI expression?
  | structure_item

So an expression can be followed by structure items without ;;, but to add another expression you need to add ;;. Also, any additional ;; are allowed, since the expression after SEMISEMI is optional: expression?.

I’m sure there are some precedence annotation involved to make all of this work together.

github.com

ocaml/ocaml/blob/3a2ae8c462744951fe5f25ae46be3ee2fe646182/parsing/parser.mly#L1239


%inline str_exp:
e = seq_expr
attrs = post_item_attributes
  { mkstrexp e attrs }
;


(* A structure element is one of the following:
 - a double semicolon followed with an optional standalone expression;
 - a structure item. *)
%inline structure_element:
  append(text_str_SEMISEMI, optional_structure_standalone_expression)
| text_str(structure_item)
    { $1 }
;


(* A structure item. *)
structure_item:
  let_bindings(ext)
    { val_of_let_bindings ~loc:$sloc $1 }
| mkstr(
    item_extension post_item_attributes

bentnib · February 12, 2019, 8:59am

This example doesn’t need a lexer that is modified during parsing or anything similar to work as suggested by the others. Your problem can be fixed by adding an extra token ‘EOF’ that indicates the end of the input:

main: 
| str1=a_term str2=b_term str3=c_term EOF { [AB_BEFORE_C(str1^str2);C_ALONE(str3)] }
| str1=a_term str2=b_term EOF { [AB_ALONE(str1^str2)] }

This grammar goes through menhir with no errors, and as long as your lexer produces the ‘EOF’ token at the end of the input, it should work as expected. See Section 6.4 “End of stream conflicts” in the Menhir manual.

grayswandyr · February 13, 2019, 10:55am

Right, I was only addressing how you may handle lookahead > 1.

Topic		Replies	Views
Can I stop lookahead in menhir for a rule? Learning	1	341	June 23, 2023
Having trouble with Menhir Learning	4	2073	August 25, 2018
Define Literals on Parser using ocamlyacc / menhir Learning	7	599	September 29, 2021
Help wanted understanding Menhir parser Learning	8	3356	September 16, 2018
Need help with a menhir grammar Learning menhir	6	1077	August 31, 2021

Lookahead in menhir parser

Related topics