How to parse a token with a list argument in Menhir?

So I just added

%token <token list> DOCBLOCK

to my parser, and now I have no idea how to actually write a grammar for it. Would appreciate any link to example or relevant parts of documentation.

Especially confused by grammar rules which take arguments:

Error: does the symbol “docblock” expect 0 or 1 argument?

Couldn’t find any examples of this online.

that is a single token. The fact that its payload is a list of tokens, is irrelevant, and menhir/yacc will ignore that. It’s just some type, to them.

To clarify some more,

%token <token list> DOCBLOCK

ends up generating code of the form:

type token = ...
  | DOCBLOCK of token list

i.e the token itself can track some additional information, usually from the lexer, however, this data is opaque to the parser.

Typically, arguments to tokens are used for things like identifiers or literals - here’s an example from a .mly file I had lying around:

(* parser.mly *)
%token <int> INT
%token <float> FLOAT
%token <bool> BOOL  
%token AND
...

expr:
  | b = BOOL  { Exp_bool (mkloc ~loc:$loc b) }
  | i = INT   { Exp_number (mkloc ~loc:$loc (Int i)) }
  | f = FLOAT { Exp_number (mkloc ~loc:$loc (Float f)) }
  | l1 = BOOL AND l2 = BOOL {  Exp_and (l1, l2) }
 ...
;;

with a lexer which would populate the arguments to the tokens as follows:

(* lexer.mll *)

let digit =  ['0' - '9']
let digit_char = ['0' - '9' '_']
let integral_number = digit digit_char*

let number = integral_number ('.' digit_char* )? (['e' 'E'] ['+' '-']? integral_number)?

rule token = parse
  | newline  { update_loc lexbuf 1 false 0; token lexbuf }
  | blank+   {token lexbuf}

  | "true" { BOOL true }
  | "false" { BOOL false }
  | number {
      match int_of_string_opt (Lexing.lexeme lexbuf) with
      | None -> FLOAT (float_of_string (Lexing.lexeme lexbuf))
      | Some i -> INT i
    }
  | "&&" -> AND

and AST:

type ast =
| Exp_bool of bool
| Exp_int of int
| Exp_float of float
| Exp_and of bool * bool
1 Like

Thanks for the elaboration!

Hmmm OK. Can I fix it by sending in a second parser to the main parser? Or flatten the list somehow? @Chet_Murthy also recommended have a stateful lexer with two rules, which I’ll try.

Menhir has some built-in grammar rules for lists, would it be better to have the lexer just generate the raw tokens (possibly the list separator tokens too) and let the grammar handle the construction of the list?
See separated_list(sep, element) where you write the grammar for the separator (could be just a single token), and the element (again could be just something like DOCBLOCKELEMENT where %token <mytoken> DOCBLOCK) and menhir will expand that to the proper grammar internally to match and construct a mytoken list.
If you have start/end markers for your list then see the delimited(start, middle, end) rule to help you write it more easily.

And you can combine them, I think something like this should work: delimited(LIST_START, separated_list(SEP, ELEMENT), LIST_END) where the capitals are all tokens produced by the lexer (but they could also be other grammar rules if you need something more complicated).

I’d love to, but I have no idea how to achieve that in the lexer, since it a “sub-language”/DSL. Well, except for splitting the lexer in two states. Will try it later today.

Code: pholyglot/src/lib/lexer.mll at main · olleharstedt/pholyglot · GitHub

Ah I see, it is like a context dependent grammar, and you have to switch lexers, so essentially the “sub-language” needs to implement both a lexer and what you’d typically implement in the “grammar” in the lexer itself, leaving you just with the outer grammar to implement in Menhir.
In that case can you match something like this in the gramar?

myrule:
| lst = DOCBLOCK  { DocBlock lst }
| ... (* more rules here *) ...

And have an OCaml type:

type t =
| DocBlock of innertoken list

given a %token <innertoken list> DOCBLOCK

If needed that innertoken list could be fed to another Menhir parser, e.g instead of DocBlock lst you’d do DocBlock (OtherParser.of_inner_tokens lst)? (where OtherParser would construct a lexing stream out of the list (no need to invoke a lexer) and call the other menhir parser with it?)

2 Likes

FWIW I wrote something similar recently (not finished yet), see this which parses a Makefile using 2 syntaxes: one for makefile rules and one for the shell syntax: xen/dune2makefile.ml at xen-builds5 · edwintorok/xen · GitHub

In this case I didn’t use menhir at all, I just combined the 2 lexers by hand (it was simpler that way, but should be possible to achieve the same with Menhir)

2 Likes