Recursion in Menhir lexer for small DSL

I need to lex and parse a small DSL inside my language, related to the docblock:

    /**
     * @param array<int> $ints
     */
    function foo(array &$ints): void {
    }

The DSL is the @param part, and the main language is function foo etc.

In lexer.mll I have a main lexer, and a sub-lexer for the docblock (always starting with /**). But I can’t figure out how the sub-lexer returns its own list of tokens. It returns one and then goes back to the main lexer again.

  | "/**"                         { docblock lexbuf }

and

and docblock = parse
  | "@param"            { DOCBLOCK_PARAM }

Source here: pholyglot/lexer.mll at main · olleharstedt/pholyglot · GitHub

Any tips how to achieve the recursive token stream here? Or link to relevant docs?

I took a quick look at your code, and … it wasn’t clear right away how your recursion works. So I’ll answer only more generally.

Lexer entrypoints are meant to only ever be invoked in tail-position, IIUC. When you call a lexer entrypoint in the action of another lexer pattern-branch, that’s “continuing the lex”. Eventually it’s all going to return a single token. If you want to accumulate some sort of list (and when I’ve done this, this is how I did it) you need to pass along an accumulating parameter to those recursive calls.

Alright, thanks for you answer. :slight_smile: You have a link to src or doc on how to do that…?

Perhaps you mean that you want to “switch lexers” – so you’ll have one lexer when you’re outside the docblock, and a different one inside it? This is a common pattern, and when I do it, I use a state-variable that tracks which lexer to call. So you’ll have lexer1 and lexer2, each of which take a parameter state. Then token will use that state to decide which of lexer1 and lexer2 to call.

Then you’ll have a function like:

let make_lexer () = let state = <expression to create state> in
fun lb -> if state.choose_lex1 then lexer1 state lb else lexer2 state lb

and then

let token = make_lexer () in
.... parse ....

Maybe this isn’t relevant at all, just guessing.

Hm, I think I use the wrong terminology. I think I just want another “rule” to parse the docblock DSL. But if two lexers are easier, sure. :slight_smile: Would still be helped by a link tho. :no_mouth:

I’m digging around; I’m sure I can find something. One more question:

do you want to parse the docblock within the lexer, or do you want to return tokens from the docblock and have some other parser construct the docblock from those tokens ?

I was thinking of using one parser only. I just want different rules to apply to the tokens inside the docblock. Less things should be allowed than in the main lexer rule. So it’s a token stream, but with slightly different rules than the main one.

Hm, so no, no parsing inside the lexer.

Here’s a really trivial example: qc-ocaml/qasm2_lexer.mll at e4c2876beeb7658fb3305449a1c5236aa11e6265 · chetmurthy/qc-ocaml · GitHub

I want to have one lexer for the first line, and then after that, a different lexer.

I’ll look around and see if I can find others that are more involved.

That’s cool, thanks!

I tried just adding a “result” param to the docblock rule to collect all tokens. BUT, now the main parser expects a single token, not a list. :angry:

Error: This expression has type token list but an expression was expected of type token

You can make that work, too. How about this: imagine a language with comments and integers. That’s it. Then you have two tokens:

INT of int
COMM of string list

Then you can write a rule like (I’m writing this in the edit window, not actually compiling, so expect errors):

rule token = parse
[0-9]+ { INT (int_of_string (lexeme lexbuf)) }
| "/*" { comment [lexeme lexbuf)] lexbuf }
| [' ' '\t' '\n']+ { token lexbuf }

and comment acc = parse
"*/" { COMM (List.rev ((lexeme lexbuf)::acc))) }
| [^ '*']+ { comment ((lexeme lexbuf)::acc) lexbuf }

I think that does it. Obviously, this is a really simple case where “parsing” the comment means just accumulating the blocks until you get an end-comment string. If you were doing something more complicated (like OCaml’s “(* … )" where the nested comments are allowed, and the "(” and “*)” must match, nesting-wise) you might want to let the parser do that. Or not – you could write your little bit of recursive-descent parser right there in the lexer. Jeremy Yallop’s Yojson “parser” is done that way (I think, for efficiency).

1 Like

Interesting. I’m trying with %token <token list> DOCBLOCK right now. Thanks for all the quick and knowledgeable feedback!! :heart:

Edit, yes, works now! :partying_face: Got debug output

START_SCRIPT DOCBLOCK (DOCBLOCK_PARAM ARRAY_TYPE LT INT_TYPE GT NAME $ints) 
FUNCTION NAME foo LPAREN ARRAY_TYPE AMPERSAND DOLLAR 
NAME ints RPAREN COLON VOID_TYPE LBRACE RBRACE 

for the snippet in OP.

1 Like

OK, I have actually no idea how to parse

`%token <token list> DOCBLOCK`

but I’ll look around. :sweat_smile:

I’m not sure, but I think it just means the generated token-type has a branch

| DOCBLOCK of token list

and nothing more.

Yeah, I was more thinking of how to design the grammar rules around it. Kinda stuck. :stuck_out_tongue: Would be nice to be able to debug the grammar, too, somehow.

wait, what do you mean by “design the grammar rules around it” ? Do you need to parse that DOCBLOCK ? b/c if so, then you might want to switch to the style of two rules with a state-variable selecting which gets called.

Right yes, I need to parse it. Maybe I misunderstood you before.

I’ll give it a go!

I wrote here an example for two very simple nested grammars and lexers: GitHub - thierry-martinez/stacked_parsers: Example of stacked parsers for https://discuss.ocaml.org/t/recursion-in-menhir-lexer-for-small-dsl/10566

The two grammars are:

s ::= 0..9+ | '(' s ')' | '{' s2 '}'
s2 ::= [A..Z a..z]+ | '(' s ')' | '{' s2 '}'

Note that the two grammars share some lexems but not all of them (the lexer of s recognizes integers, the lexer of s2 recognizes words on the alphabet az), and the two grammars can be arbitrary nested.

I give two possible implementations: the first variant uses a stack of lexers, in the vein of what @Chet_Murthy describes. The second variant makes the lexer call the parser to parse nested grammars and stores the parsing result in a single token for the parent parser: in other words, the lexer uses the call stack instead of an explicit stack. I tend to think that this second variant is quite clean(er?) but I should admit I have never seen uses of it in the past (but I have seen quite a lot of uses of lexers with explicit stacks), and I don’t know why.

Olle, OK, maybe a first step could be to make your lexer able to switch states and parse within and outside of the docblock ? Are you able to do that ?

Awesome, thanks! Will read. :smiley:

OK, my current (“working”, so far) solution is to just split it up in two different lexer/parser systems instead. Downside is I have to duplicate some token defs that are used in both lexers, but I guess I’ll live with that for now.

Instead of parsing the docblock to tokens in the main lexer, I do

  | "/**" _* as s "*/"            { DOCBLOCK_AS_STR s }

and then the docblock lexer/parser is called from the main parser, like

    | doc=DOCBLOCK_AS_STR "function" {
        let linebuf = Lexing.from_string doc in
        let cb = Docblockparser.docblock Docblocklexer.docblock linebuf in
        Function {
            docblock = cb;
            (* Code omitted *)
        }

Thanks for all help and feedback, learned a lot, maybe I can get it tighter at a later stage. :nerd_face: