Help wanted understanding Menhir parser

I am confused about the behavior of my Menhir parser. I isolated the confusing behavior to a minimal grammar:

%{
%}

%token DOT
%token <string> IDENT

%start<string list * string> path
%start<string> good

%%

path: list(IDENT DOT { $1 }) IDENT { $1, $2 }
good: IDENT { $1 }

I expect path to accept strings like IDENT, IDENT DOT IDENT, IDENT DOT IDENT DOT IDENT, etc. However, when running the Menhir interpreter, I get overshoot errors (and in my bigger project, the parse fails where I expect it to succeed due to the productions above).

Menhir gives a warning:

File "/home/REDACTED/.opam/4.07.0/lib/menhir/standard.mly", line 197, characters 16-16:
Warning: production list(__anonymous_0) -> is never reduced.
Warning: in total, 1 production is never reduced.

I don’t understand why. I computed the LR(1) state machine by hand:

GRAMMAR

S': path $

generated: IDENT DOT

list_generated:
  |
  | generated list_generated

path: list_generated IDENT

ITEM SETS

s0:
  S': . path $                                 { } SHIFT s1
  + path: . list_generated IDENT               { $ } GOTO s2
  + list_generated: .                          { IDENT } REDUCE s0
  + list_generated: . generated list_generated { IDENT } GOTO s3
  + generated: . IDENT DOT                     { IDENT } SHIFT s4

s1:
  S': path . $ { } SHIFT s5

s2:
  path: list_generated . IDENT { $ } SHIFT s6

s3:
  list_generated: generated . list_generated   { IDENT } GOTO s7
  + list_generated: .                          { IDENT } REDUCE s0
  + list_generated: . generated list_generated { IDENT } GOTO s3
  + generated: . IDENT DOT                     { IDENT } SHIFT s9

s4:
  generated: IDENT . DOT { IDENT } SHIFT s8

s5:
  S': path $ . { } ACCEPT

s6:
  path: list_generated IDENT . { $ } REDUCE s0

s7:
  list_generated: generated list_generated . { IDENT } REDUCE s0

s8:
  generated: IDENT DOT . { IDENT } REDUCE s0

s9:
  generated: IDENT . DOT { IDENT } SHIFT s10

s10:
  generated: IDENT DOT . { IDENT } REDUCE s3

What am I doing wrong? Did I make a mistake in my grammar that I’m stupidly overlooking? Does Menhir have different semantics than traditional LR(1)?

EDIT: Perhaps it’s due to a shift/reduce conflict with IDENT? I just realized that this might be it, so I’m going to investigate more.

EDIT 2: The left-recursive

a:
  | IDENT { $1 }
  | a DOT IDENT { $1 }

also fails. Unfortunately, I don’t have the time to make state machines by hand for everything.

Left-recursive versus right-recursive lists in LR parsers discusses this problem specifically for Menhir. Your left recursive definition in edit 2 looks correct to me (expecept not capturing the elements). I haven’t used Menhir but it looks like it has a standard library that includes a definition for these kinds of lists that you can use. I suspect your problem is in other parts of the grammar.

Below is simple grammar path.mly for OCamlYacc (that is also accepted by Menhir):

%token DOT
%token EOF
%token <string> IDENT

%start path
%type <string list> path

%%

path  : path_ EOF         { List.rev $1 }
      ;

path_ : IDENT             { [$1] }
      | path_ DOT IDENT   { $3 :: $1 }
      ;

The scanner scan.mll:

{
  let get      = Lexing.lexeme

  exception Error of string
  let error fmt = Printf.kprintf (fun msg -> raise (Error msg)) fmt
}

let alpha = ['a'-'z' 'A'-'Z']
let digit = ['0'-'9']
let word  = (alpha|digit)+
let ws    = [' ' '\t' '\n']


rule token = parse
  | word  { IDENT(get lexbuf) }
  | '.'   { DOT }
  | ws+   { token lexbuf }
  | eof   { EOF }
  | _     { error "illegal character" }

utop # #use "path.ml";;
utop # #use "scan.ml";;
utop # path token (Lexing.from_string "a1. b2 . c3");;
- : string list = ["a1"; "b2"; "c3"]

One of the problem I see with the grammar you gave is that there is no way to recognize when a list is finished.

If I understand correctly, the parse tries and parse a list of idents separated by dots. However, suppose you have read: “IDENT DOT IDENT” : should the parser wait for a new token to continue parsing ? or return the 2-element list ? What happens if afterwards the parser receives the tokens “DOT IDENT” ? Then certainly returning the 2-element list was wrong: thus the production never reduces because there might always be a next element in the list (hence the error message).

Basically, I think you lack an end of stream/file token at the end of your pathproduction.

2 Likes

See the 6.4 End of stream conflict section of the menhir manual, which explains why menhir behaves this way.

1 Like

Yes, one should always have an explicit “end of stream” token (lets call it ξ) generated by one’s lexical analyzer, and one’s initial production S can always be replaced by a new production S' → S ξ which is made the new initial production. That is generally necessary in most parsing systems.

@zozozo @perry When I run the Menhir interpreter, I’ve found that the end of input token isn’t necessary when the start nonterminal is otherwise problem-free. My understanding is that according to the LR algorithm, the grammar is “augmented” by a special start nonterminal, so the step would be done by the parser generator, although I don’t know about Menhir specifically.

Solution: The way that my data types work, a path consists of a prefix, AKA a list of strings, and a final name, a single string, to the right (string list * string). Thus, I need access to the end of the path while still having the prefix with the leftmost name at the front of the list. Therefore, solutions such as @lindig’s, although good, are inconvenient in the context of working with my specific data types. I fixed the problem by separating identifiers into capitalized and lowercase ones, and doing

path:
  | list(UIDENT DOT { $1 }) LIDENT { $1, $2 }

The limitation is that module names must be capitalized whereas type and value names must start with a lowercase letter (the grammar is for a programming language). As the MLs have this syntax design, I’m willing to settle for the identifier distinction.

If I run menhir --explain test.mly from your original grammar, I get a test.conflicts file explaining the conflicts. There are two conflicts that are similar, the simplest one in the file is explained as follows:

** The derivations that appear below have the following common factor:
** (The question mark symbol (?) represents the spot where the derivations begin to differ.)

path 
(?)

** In state 3, looking ahead at IDENT, reducing production
** list(__anonymous_0) -> 
** is permitted because of the following sub-derivation:

list(__anonymous_0) IDENT // lookahead token appears
. 

** In state 3, looking ahead at IDENT, shifting is permitted
** because of the following sub-derivation:

list(__anonymous_0) IDENT 
. IDENT DOT list(__anonymous_0) 

Let me paraphrase what it says.

  1. Imagine I am trying to parse path.

  2. and the next/lookahead token is IDENT

  3. I am allowed to use the rule “empty list” for LIST(IDENT DOT), so IDENT can be the last IDENT of the path

  4. But I am also allowed to use the rule “non-empty list” for LIST(IDENT DOT), and then IDENT I see is the first identifier of the list

As was previously mentioned, Menhir provides generic separated_list(sep, item) and separated_nonempty_rule(sep, item) rules that do not conflict, so it’s probably simplest to implement split_last : 'a list -> ('a list * 'a) option on your side.

2 Likes

As is explained in this section of the Menhir manual, Menhir and ocamlyacc do not augment the grammar with an end token on their own. That section will tell you quite a bit you will want to known. Indeed, Menhir has a wonderful manual which is very complete and very easy to read. I recommend it highly.

2 Likes

When working on my grammar, I realized that I wanted to have paths that ended in a capitalized identifier, in which my previous solution wouldn’t work.

My new solution was:

upath: UIDENT list(DOT UIDENT { $2 }) {
      match Base.List.rev $2 with
      | [] -> [], $1
      | last::rev_path -> $1::(Base.List.rev rev_path), last
    }

@gasche Thank you for explaining the shift/reduce conflict. I personally avoid separated_nonempty_list and nonempty_list because I want to encode in the type system that there is at least one element, and getting the last element of an arbitrary list is a partial operation.

@perry. Thank you for pointing me to that manual section. In the future I will read it carefully whenever I have an issue. You recommended that I define an EOF token. My understanding is that the EOF token is just a regular token from Menhir’s point of view? Thus, the problem in my original grammar would persist even when I used an explicit, separate start nonterminal with an EOF at the end of its production, the same way that the problem would persist when I use the productions in any larger grammar.

Edit: When I tried my original grammar with a new start symbol and an EOF token, the problem still exists, but with REJECT instead of OVERSHOOT errors, in the same manner that I got REJECT errors when using the problematic productions in my larger grammar. The problem is definitely inherent to the shift/reduce conflict and not solved by adding another nonterminal as a start symbol.