Need help with a menhir grammar

Hi !

I’m stuck with conflict problem in menhir that I do not know how to solve.

I want to declare an ending sequence after a initial sequence:

%token B
%token W
%token EOL

%start<unit> main

%% 

seq    :                {}
       | seq B          {}
       | seq W          {}
       ;

ending :                {}
       | W W W          {}
       ;

main   : seq ending EOL {}

This grammar accept a sequence like B W B W W W EOL, but fail with B W W B EOL

$ echo "B W W B EOL" | menhir --interpret --interpret-show-cst bw.mly 
Warning: one state has shift/reduce conflicts.
Warning: one shift/reduce conflict was arbitrarily resolved.
Ready!
REJECT

I’ve seen the conflict pointed by the parser, and I understand it, but I do not know how to deal with the priority operators in order to solve it.

Does anybody know a way to make the grammar accept an ending sequence and give the it the priority over the initial sequence ?

Regards

Why should “B W W B” not be accepted? “ending” can parse the empty string, and seq is effectively (“B” | “W”)* isn’t it?

Can you specify more precisely what language you want to parse? [I recognize the contradiction in asking for “more precisely” when, in our field, the way you do that, is to write a grammar *grin*]

The language you’re trying to accept is:

(B|W)*(WWW)?<EOL>

yes?

Thanks for taking a look to my problem!

Why should “B W W B” not be accepted?

This what I do not fully understand. This shall be accepted on the paper, but menhir reject this sequence.

Can you specify more precisely what language you want to parse?

I have reduce the problem to a minimal (non-)working example, but my intention is to parse the conjugation of French verbs. To keep it simple, in English with the verb “to walk” you have the stem “walk” and thoses suffixes:

  • walk-{}
  • walk-ed
  • walk-ing
  • walk-s

In French you have much more declinations, and I’m falling in an error like the one presented in the first grammar:

  • err-e
  • err-er
  • err-erez
  • err-ez

The menhir documentation explain how to deal with conflicts, and the use of priority operators in order to tell the parser which rule to apply, but I didn’t find a way to make it accept my grammar with only two tokens!

It seems like the grammar has an inherent ambiguity no?

Should B W W W be parsed as

  • [seq: [seq: [seq: [seq: [seq:] B] W] W] W]
  • [seq: [seq:] B] [ending: W W W]

OK, now I get your problem. I thought you were asking why the -grammar- should not accept BWWB, not why the menhir automaton did not accept it.

Sure, your grammar is ambiguous (as @Gopiandcode points out). So let’s step away from it: what language are you trying to parse?

In the examples you give, are what are the tokens? Are they “walk”, “ed”, “ing”, “err”, “e” etc? Or are they characters? There’s a well-understood problem that there can be nondeterminism at the level of characters, which can be handled in a lexer, leaving determinism at the level of parsing. So a better explanation of the language you’re trying to parse would be useful.

But backing up, if you’re actually trying to parse actual natural language, I suspect you probably want to back up and look into the literature. using deterministic parsers for this stuff failed decades ago, and an entire field of probabilistic parsing grew up in its stead. [I’m no specialist, so it’s possible that probabilistic parsing got supplanted by something else, too] Also, even if you don’t go look at that stuff, you probably will want to at least look into Earley or CYK parsing: so you can write arbitrary CFGs.

Thanks for the clarification. I admit that I embarked on this project without really taking the time to read up on the existing points beforehand, seeing it as both a way to learn OCaml grammars and to explore the field of linguistics.

I’ll take a look at it all with a cool head.