I need to lex and parse a small DSL inside my language, related to the docblock:
/**
* @param array<int> $ints
*/
function foo(array &$ints): void {
}
The DSL is the @param
part, and the main language is function foo etc.
In lexer.mll I have a main lexer, and a sub-lexer for the docblock (always starting with /**
). But I can’t figure out how the sub-lexer returns its own list of tokens. It returns one and then goes back to the main lexer again.
| "/**" { docblock lexbuf }
and
and docblock = parse
| "@param" { DOCBLOCK_PARAM }
Source here: pholyglot/lexer.mll at main · olleharstedt/pholyglot · GitHub
Any tips how to achieve the recursive token stream here? Or link to relevant docs?
I took a quick look at your code, and … it wasn’t clear right away how your recursion works. So I’ll answer only more generally.
Lexer entrypoints are meant to only ever be invoked in tail-position, IIUC. When you call a lexer entrypoint in the action of another lexer pattern-branch, that’s “continuing the lex”. Eventually it’s all going to return a single token. If you want to accumulate some sort of list (and when I’ve done this, this is how I did it) you need to pass along an accumulating parameter to those recursive calls.
Alright, thanks for you answer. You have a link to src or doc on how to do that…?
Perhaps you mean that you want to “switch lexers” – so you’ll have one lexer when you’re outside the docblock, and a different one inside it? This is a common pattern, and when I do it, I use a state-variable that tracks which lexer to call. So you’ll have lexer1
and lexer2
, each of which take a parameter state
. Then token
will use that state to decide which of lexer1 and lexer2 to call.
Then you’ll have a function like:
let make_lexer () = let state = <expression to create state> in
fun lb -> if state.choose_lex1 then lexer1 state lb else lexer2 state lb
and then
let token = make_lexer () in
.... parse ....
Maybe this isn’t relevant at all, just guessing.
1 Like
Hm, I think I use the wrong terminology. I think I just want another “rule” to parse the docblock DSL. But if two lexers are easier, sure. Would still be helped by a link tho.
I’m digging around; I’m sure I can find something. One more question:
do you want to parse the docblock within the lexer, or do you want to return tokens from the docblock and have some other parser construct the docblock from those tokens ?
I was thinking of using one parser only. I just want different rules to apply to the tokens inside the docblock. Less things should be allowed than in the main lexer rule. So it’s a token stream, but with slightly different rules than the main one.
Hm, so no, no parsing inside the lexer.
Here’s a really trivial example: qc-ocaml/qasm2_lexer.mll at e4c2876beeb7658fb3305449a1c5236aa11e6265 · chetmurthy/qc-ocaml · GitHub
I want to have one lexer for the first line, and then after that, a different lexer.
I’ll look around and see if I can find others that are more involved.
That’s cool, thanks!
I tried just adding a “result” param to the docblock rule to collect all tokens. BUT, now the main parser expects a single token, not a list.
Error: This expression has type token list but an expression was expected of type token
You can make that work, too. How about this: imagine a language with comments and integers. That’s it. Then you have two tokens:
INT of int
COMM of string list
Then you can write a rule like (I’m writing this in the edit window, not actually compiling, so expect errors):
rule token = parse
[0-9]+ { INT (int_of_string (lexeme lexbuf)) }
| "/*" { comment [lexeme lexbuf)] lexbuf }
| [' ' '\t' '\n']+ { token lexbuf }
and comment acc = parse
"*/" { COMM (List.rev ((lexeme lexbuf)::acc))) }
| [^ '*']+ { comment ((lexeme lexbuf)::acc) lexbuf }
I think that does it. Obviously, this is a really simple case where “parsing” the comment means just accumulating the blocks until you get an end-comment string. If you were doing something more complicated (like OCaml’s “(* … )" where the nested comments are allowed, and the "(” and “*)” must match, nesting-wise) you might want to let the parser do that. Or not – you could write your little bit of recursive-descent parser right there in the lexer. Jeremy Yallop’s Yojson “parser” is done that way (I think, for efficiency).
1 Like
Interesting. I’m trying with %token <token list> DOCBLOCK
right now. Thanks for all the quick and knowledgeable feedback!!
Edit, yes, works now! Got debug output
START_SCRIPT DOCBLOCK (DOCBLOCK_PARAM ARRAY_TYPE LT INT_TYPE GT NAME $ints)
FUNCTION NAME foo LPAREN ARRAY_TYPE AMPERSAND DOLLAR
NAME ints RPAREN COLON VOID_TYPE LBRACE RBRACE
for the snippet in OP.
1 Like
OK, I have actually no idea how to parse
`%token <token list> DOCBLOCK`
but I’ll look around.
I’m not sure, but I think it just means the generated token-type has a branch
| DOCBLOCK of token list
and nothing more.
Yeah, I was more thinking of how to design the grammar rules around it. Kinda stuck. Would be nice to be able to debug the grammar, too, somehow.
wait, what do you mean by “design the grammar rules around it” ? Do you need to parse that DOCBLOCK ? b/c if so, then you might want to switch to the style of two rules with a state-variable selecting which gets called.
Right yes, I need to parse it. Maybe I misunderstood you before.
I’ll give it a go!
I wrote here an example for two very simple nested grammars and lexers: GitHub - thierry-martinez/stacked_parsers: Example of stacked parsers for https://discuss.ocaml.org/t/recursion-in-menhir-lexer-for-small-dsl/10566
The two grammars are:
s ::= 0..9+ | '(' s ')' | '{' s2 '}'
s2 ::= [A..Z a..z]+ | '(' s ')' | '{' s2 '}'
Note that the two grammars share some lexems but not all of them (the lexer of s
recognizes integers, the lexer of s2
recognizes words on the alphabet a
…z
), and the two grammars can be arbitrary nested.
I give two possible implementations: the first variant uses a stack of lexers, in the vein of what @Chet_Murthy describes. The second variant makes the lexer call the parser to parse nested grammars and stores the parsing result in a single token for the parent parser: in other words, the lexer uses the call stack instead of an explicit stack. I tend to think that this second variant is quite clean(er?) but I should admit I have never seen uses of it in the past (but I have seen quite a lot of uses of lexers with explicit stacks), and I don’t know why.
1 Like
Olle, OK, maybe a first step could be to make your lexer able to switch states and parse within and outside of the docblock ? Are you able to do that ?
Awesome, thanks! Will read.
OK, my current (“working”, so far) solution is to just split it up in two different lexer/parser systems instead. Downside is I have to duplicate some token defs that are used in both lexers, but I guess I’ll live with that for now.
Instead of parsing the docblock to tokens in the main lexer, I do
| "/**" _* as s "*/" { DOCBLOCK_AS_STR s }
and then the docblock lexer/parser is called from the main parser, like
| doc=DOCBLOCK_AS_STR "function" {
let linebuf = Lexing.from_string doc in
let cb = Docblockparser.docblock Docblocklexer.docblock linebuf in
Function {
docblock = cb;
(* Code omitted *)
}
Thanks for all help and feedback, learned a lot, maybe I can get it tighter at a later stage.