How to parse a token with a list argument in Menhir?

olleharstedt · October 1, 2022, 10:23pm

So I just added

%token <token list> DOCBLOCK

to my parser, and now I have no idea how to actually write a grammar for it. Would appreciate any link to example or relevant parts of documentation.

Especially confused by grammar rules which take arguments:

Error: does the symbol “docblock” expect 0 or 1 argument?

Couldn’t find any examples of this online.

Chet_Murthy · October 2, 2022, 12:02am

that is a single token. The fact that its payload is a list of tokens, is irrelevant, and menhir/yacc will ignore that. It’s just some type, to them.

Gopiandcode · October 2, 2022, 8:29am

To clarify some more,

%token <token list> DOCBLOCK

ends up generating code of the form:

type token = ...
  | DOCBLOCK of token list

i.e the token itself can track some additional information, usually from the lexer, however, this data is opaque to the parser.

Typically, arguments to tokens are used for things like identifiers or literals - here’s an example from a .mly file I had lying around:

(* parser.mly *)
%token <int> INT
%token <float> FLOAT
%token <bool> BOOL  
%token AND
...

expr:
  | b = BOOL  { Exp_bool (mkloc ~loc:$loc b) }
  | i = INT   { Exp_number (mkloc ~loc:$loc (Int i)) }
  | f = FLOAT { Exp_number (mkloc ~loc:$loc (Float f)) }
  | l1 = BOOL AND l2 = BOOL {  Exp_and (l1, l2) }
 ...
;;

with a lexer which would populate the arguments to the tokens as follows:

(* lexer.mll *)

let digit =  ['0' - '9']
let digit_char = ['0' - '9' '_']
let integral_number = digit digit_char*

let number = integral_number ('.' digit_char* )? (['e' 'E'] ['+' '-']? integral_number)?

rule token = parse
  | newline  { update_loc lexbuf 1 false 0; token lexbuf }
  | blank+   {token lexbuf}

  | "true" { BOOL true }
  | "false" { BOOL false }
  | number {
      match int_of_string_opt (Lexing.lexeme lexbuf) with
      | None -> FLOAT (float_of_string (Lexing.lexeme lexbuf))
      | Some i -> INT i
    }
  | "&&" -> AND

and AST:

type ast =
| Exp_bool of bool
| Exp_int of int
| Exp_float of float
| Exp_and of bool * bool

olleharstedt · October 2, 2022, 9:36am

Thanks for the elaboration!

Hmmm OK. Can I fix it by sending in a second parser to the main parser? Or flatten the list somehow? @Chet_Murthy also recommended have a stateful lexer with two rules, which I’ll try.

edwin · October 2, 2022, 10:29am

Menhir has some built-in grammar rules for lists, would it be better to have the lexer just generate the raw tokens (possibly the list separator tokens too) and let the grammar handle the construction of the list?
See separated_list(sep, element) where you write the grammar for the separator (could be just a single token), and the element (again could be just something like DOCBLOCKELEMENT where %token <mytoken> DOCBLOCK) and menhir will expand that to the proper grammar internally to match and construct a mytoken list.
If you have start/end markers for your list then see the delimited(start, middle, end) rule to help you write it more easily.

And you can combine them, I think something like this should work: delimited(LIST_START, separated_list(SEP, ELEMENT), LIST_END) where the capitals are all tokens produced by the lexer (but they could also be other grammar rules if you need something more complicated).

olleharstedt · October 2, 2022, 10:33am

I’d love to, but I have no idea how to achieve that in the lexer, since it a “sub-language”/DSL. Well, except for splitting the lexer in two states. Will try it later today.

Code: pholyglot/src/lib/lexer.mll at main · olleharstedt/pholyglot · GitHub

edwin · October 2, 2022, 10:50am

Ah I see, it is like a context dependent grammar, and you have to switch lexers, so essentially the “sub-language” needs to implement both a lexer and what you’d typically implement in the “grammar” in the lexer itself, leaving you just with the outer grammar to implement in Menhir.
In that case can you match something like this in the gramar?

myrule:
| lst = DOCBLOCK  { DocBlock lst }
| ... (* more rules here *) ...

And have an OCaml type:

type t =
| DocBlock of innertoken list

given a %token <innertoken list> DOCBLOCK

If needed that innertoken list could be fed to another Menhir parser, e.g instead of DocBlock lst you’d do DocBlock (OtherParser.of_inner_tokens lst)? (where OtherParser would construct a lexing stream out of the list (no need to invoke a lexer) and call the other menhir parser with it?)

edwin · October 2, 2022, 10:56am

FWIW I wrote something similar recently (not finished yet), see this which parses a Makefile using 2 syntaxes: one for makefile rules and one for the shell syntax: xen/dune2makefile.ml at xen-builds5 · edwintorok/xen · GitHub

In this case I didn’t use menhir at all, I just combined the 2 lexers by hand (it was simpler that way, but should be possible to achieve the same with Menhir)

Topic		Replies	Views
Call Menhir-generated parser without a lexbuf Learning menhir	8	628	October 2, 2022
Define Literals on Parser using ocamlyacc / menhir Learning	7	599	September 29, 2021
Ocamllex and menhir examples that aren't calculators Learning menhir , ocamllex , ocamlyacc	10	3706	November 20, 2021
Menhir conflict for Julia grammar Learning menhir	4	1255	November 5, 2020
Parsing negative integers in a calculator Learning menhir	2	605	March 5, 2023

How to parse a token with a list argument in Menhir?

Related topics