Recursion in Menhir lexer for small DSL

olleharstedt · October 1, 2022, 8:02pm

I need to lex and parse a small DSL inside my language, related to the docblock:

    /**
     * @param array<int> $ints
     */
    function foo(array &$ints): void {
    }

The DSL is the @param part, and the main language is function foo etc.

In lexer.mll I have a main lexer, and a sub-lexer for the docblock (always starting with /**). But I can’t figure out how the sub-lexer returns its own list of tokens. It returns one and then goes back to the main lexer again.

  | "/**"                         { docblock lexbuf }

and

and docblock = parse
  | "@param"            { DOCBLOCK_PARAM }

Source here: pholyglot/lexer.mll at main · olleharstedt/pholyglot · GitHub

Any tips how to achieve the recursive token stream here? Or link to relevant docs?

Chet_Murthy · October 1, 2022, 8:08pm

I took a quick look at your code, and … it wasn’t clear right away how your recursion works. So I’ll answer only more generally.

Lexer entrypoints are meant to only ever be invoked in tail-position, IIUC. When you call a lexer entrypoint in the action of another lexer pattern-branch, that’s “continuing the lex”. Eventually it’s all going to return a single token. If you want to accumulate some sort of list (and when I’ve done this, this is how I did it) you need to pass along an accumulating parameter to those recursive calls.

olleharstedt · October 1, 2022, 8:10pm

Alright, thanks for you answer. You have a link to src or doc on how to do that…?

Chet_Murthy · October 1, 2022, 8:12pm

Perhaps you mean that you want to “switch lexers” – so you’ll have one lexer when you’re outside the docblock, and a different one inside it? This is a common pattern, and when I do it, I use a state-variable that tracks which lexer to call. So you’ll have lexer1 and lexer2, each of which take a parameter state. Then token will use that state to decide which of lexer1 and lexer2 to call.

Then you’ll have a function like:

let make_lexer () = let state = <expression to create state> in
fun lb -> if state.choose_lex1 then lexer1 state lb else lexer2 state lb

and then

let token = make_lexer () in
.... parse ....

Maybe this isn’t relevant at all, just guessing.

olleharstedt · October 1, 2022, 8:14pm

Hm, I think I use the wrong terminology. I think I just want another “rule” to parse the docblock DSL. But if two lexers are easier, sure. Would still be helped by a link tho.

Chet_Murthy · October 1, 2022, 8:15pm

I’m digging around; I’m sure I can find something. One more question:

do you want to parse the docblock within the lexer, or do you want to return tokens from the docblock and have some other parser construct the docblock from those tokens ?

olleharstedt · October 1, 2022, 8:18pm

I was thinking of using one parser only. I just want different rules to apply to the tokens inside the docblock. Less things should be allowed than in the main lexer rule. So it’s a token stream, but with slightly different rules than the main one.

Hm, so no, no parsing inside the lexer.

Chet_Murthy · October 1, 2022, 8:30pm

Here’s a really trivial example: qc-ocaml/qasm2_lexer.mll at e4c2876beeb7658fb3305449a1c5236aa11e6265 · chetmurthy/qc-ocaml · GitHub

I want to have one lexer for the first line, and then after that, a different lexer.

I’ll look around and see if I can find others that are more involved.

olleharstedt · October 1, 2022, 8:36pm

That’s cool, thanks!

I tried just adding a “result” param to the docblock rule to collect all tokens. BUT, now the main parser expects a single token, not a list.

Error: This expression has type token list but an expression was expected of type token

github.com

olleharstedt/pholyglot/blob/main/src/lib/lexer.mll#L102


      
            | "int"                         { INT_TYPE }
            | "float"                       { FLOAT_TYPE }
            | "void"                        { VOID_TYPE }
            | "string"                      { STRING_TYPE }
            | "array"                       { ARRAY_TYPE }
            | "&"                           { AMPERSAND }
            | "foreach"                     { FOREACH }
            | "as"                          { AS }
            | "do"                          { DO }
            | "while"                       { WHILE }
            | ['"'] [^ '"']+ ['"'] as s     { STRING_LITERAL s }
            | class_name as n               { CLASS_NAME n}
            | var_identifier as id          { VAR_NAME (String.sub id 1 (String.length id - 1)) }
            | identifier as id              { NAME id }
            | eof                           { EOF }
            | _ { raise (Error (Printf.sprintf "At offset %d: unexpected character.\n" (Lexing.lexeme_start lexbuf))) }
          
          and initial_linebegin = parse
            | '\n'                          { new_line lexbuf; initial_linebegin lexbuf }
            | whitespace_char_no_newline    { initial_linebegin lexbuf }
            | ""                            { token lexbuf }

Chet_Murthy · October 1, 2022, 8:45pm

You can make that work, too. How about this: imagine a language with comments and integers. That’s it. Then you have two tokens:

INT of int
COMM of string list

Then you can write a rule like (I’m writing this in the edit window, not actually compiling, so expect errors):

rule token = parse
[0-9]+ { INT (int_of_string (lexeme lexbuf)) }
| "/*" { comment [lexeme lexbuf)] lexbuf }
| [' ' '\t' '\n']+ { token lexbuf }

and comment acc = parse
"*/" { COMM (List.rev ((lexeme lexbuf)::acc))) }
| [^ '*']+ { comment ((lexeme lexbuf)::acc) lexbuf }

I think that does it. Obviously, this is a really simple case where “parsing” the comment means just accumulating the blocks until you get an end-comment string. If you were doing something more complicated (like OCaml’s “(* … )" where the nested comments are allowed, and the "(” and “*)” must match, nesting-wise) you might want to let the parser do that. Or not – you could write your little bit of recursive-descent parser right there in the lexer. Jeremy Yallop’s Yojson “parser” is done that way (I think, for efficiency).

olleharstedt · October 1, 2022, 8:49pm

Interesting. I’m trying with %token <token list> DOCBLOCK right now. Thanks for all the quick and knowledgeable feedback!!

Edit, yes, works now! Got debug output

START_SCRIPT DOCBLOCK (DOCBLOCK_PARAM ARRAY_TYPE LT INT_TYPE GT NAME $ints) 
FUNCTION NAME foo LPAREN ARRAY_TYPE AMPERSAND DOLLAR 
NAME ints RPAREN COLON VOID_TYPE LBRACE RBRACE

for the snippet in OP.

olleharstedt · October 1, 2022, 10:07pm

OK, I have actually no idea how to parse

`%token <token list> DOCBLOCK`

but I’ll look around.

Chet_Murthy · October 1, 2022, 11:37pm

I’m not sure, but I think it just means the generated token-type has a branch

| DOCBLOCK of token list

and nothing more.

olleharstedt · October 1, 2022, 11:40pm

Yeah, I was more thinking of how to design the grammar rules around it. Kinda stuck. Would be nice to be able to debug the grammar, too, somehow.

Chet_Murthy · October 2, 2022, 12:01am

wait, what do you mean by “design the grammar rules around it” ? Do you need to parse that DOCBLOCK ? b/c if so, then you might want to switch to the style of two rules with a state-variable selecting which gets called.

olleharstedt · October 2, 2022, 9:34am

Right yes, I need to parse it. Maybe I misunderstood you before.

I’ll give it a go!

thierry-martinez · October 2, 2022, 7:51pm

I wrote here an example for two very simple nested grammars and lexers: GitHub - thierry-martinez/stacked_parsers: Example of stacked parsers for https://discuss.ocaml.org/t/recursion-in-menhir-lexer-for-small-dsl/10566

The two grammars are:

s ::= 0..9+ | '(' s ')' | '{' s2 '}'
s2 ::= [A..Z a..z]+ | '(' s ')' | '{' s2 '}'

Note that the two grammars share some lexems but not all of them (the lexer of s recognizes integers, the lexer of s2 recognizes words on the alphabet a…z), and the two grammars can be arbitrary nested.

I give two possible implementations: the first variant uses a stack of lexers, in the vein of what @Chet_Murthy describes. The second variant makes the lexer call the parser to parse nested grammars and stores the parsing result in a single token for the parent parser: in other words, the lexer uses the call stack instead of an explicit stack. I tend to think that this second variant is quite clean(er?) but I should admit I have never seen uses of it in the past (but I have seen quite a lot of uses of lexers with explicit stacks), and I don’t know why.

Chet_Murthy · October 2, 2022, 7:53pm

Olle, OK, maybe a first step could be to make your lexer able to switch states and parse within and outside of the docblock ? Are you able to do that ?

olleharstedt · October 2, 2022, 7:58pm

Awesome, thanks! Will read.

olleharstedt · October 2, 2022, 9:38pm

OK, my current (“working”, so far) solution is to just split it up in two different lexer/parser systems instead. Downside is I have to duplicate some token defs that are used in both lexers, but I guess I’ll live with that for now.

Instead of parsing the docblock to tokens in the main lexer, I do

  | "/**" _* as s "*/"            { DOCBLOCK_AS_STR s }

and then the docblock lexer/parser is called from the main parser, like

    | doc=DOCBLOCK_AS_STR "function" {
        let linebuf = Lexing.from_string doc in
        let cb = Docblockparser.docblock Docblocklexer.docblock linebuf in
        Function {
            docblock = cb;
            (* Code omitted *)
        }

Thanks for all help and feedback, learned a lot, maybe I can get it tighter at a later stage.

Topic		Replies	Views
How to parse a token with a list argument in Menhir? Learning menhir	7	1662	October 2, 2022
Call Menhir-generated parser without a lexbuf Learning menhir	8	642	October 2, 2022
Ocamllex and menhir examples that aren't calculators Learning menhir , ocamllex , ocamlyacc	10	3812	November 20, 2021
Define Literals on Parser using ocamlyacc / menhir Learning	7	605	September 29, 2021
Lookahead in menhir parser Learning	8	2268	February 13, 2019

Recursion in Menhir lexer for small DSL

Related topics