Implement context dependent language with ocamlyacc

Hi there,

I am implementing a context-dependent language, and this language supports custom typedef, pointer and multiplication. So there are two possible parser trees for the expression below

foo * bar;
  1. variable foo multiplies variable bar
  2. declare a pointer bar of type foo

My grammar cannot handle this situation yet:

simp :
  | t = dtype; ident = Ident;
      { Cst.New_var { t = t; name = ident } }
  | e = exp;
      { Cst.Sexp e }
  | other rules
exp :
  | lhs = exp;
    Star;
    rhs = exp;
  | other rules

My plan is to add a mutable set in the header. This set will store custom type during parsing. If foo is a declared type, then it will parse it as a declare statement. Otherwise, it will treat it as a multiplication. However, I am not sure how to use this mutable set to determine which rule to apply :frowning: I am not even sure if this approach is correct…

Any ideas would be grateful!

You could have a first exp type which would be something like type exp = ... | Star of string * string. Then, you would write an OCaml function exp -> exp' with type exp' = ... | Mul of string * string | Ptr of string * string. This would avoid having to do too complex things in the parser.

Are you designing the language ? If yes, the better solution would probably be to change the syntax to avoid the problem.

Is the multiplication only going to happen on the right-hand side of a assignment: int x = y * z; ? If so, I believe you can fix the problem directly in the parser without introducing a mutable set or whatever.

1 Like

Thank you for your reply!

Are you designing the language?

In fact, I am self-studying a compiler course, and this is its assignment. It is basically a simplified version of C. I copied the syntax below

type exp = ... | Star of string * string seems to treat declaration as an expression, right? In fact, this approach is mentioned in the handout, but it states

One way to handle this is to perform an ambiguous parse: use one rule to parse both the
declaration form and the expression form. Then undo an incorrect decision during elaboration. This approach will almost certainly involve some other adjustment to various pieces of your grammar and lexer. Looking over grammars and parsers from past years, we cannot honestly recommend this technique—results seem to be largely contorted with low confidence in correctness.

Is the multiplication only going to happen on the right-hand side of a assignment: int x = y * z; ?

Sadly, an expression can be a statement just for potential side-effect(like 0/0) in this language. So y * z; is just legal.

I suggest to go with the solution described in your assignment: “the parser can update mutable state to record type identifiers as such. With this approach, the lexer can produce different tokens for type and
non-type identifiers.” Making the lexer a bit smarter is much easier than making the parser context-dependent.

1 Like

Thank you for your suggestion! Modifying the lexer is indeed simpler. I am working on it and see if it goes well!

I am sorry the title is misleading, it should be Implement context dependent language with menhir. I didn’t distinguish ocamlyacc and menhir before…

I read the menhir document and a related post that makes lexer a bit smarter. Now i got some clues but still need some helps.

I am planning to return different tokens, say variable token and type token, according the same rule. snippet of .mll file looks like

let ident = ['A'-'Z' 'a'-'z' '_']['A'-'Z' 'a'-'z' '0'-'9' '_']*
rule initial type_set = parse
  | ident as name {if ident in type_set then TIdent name else VIdent name}

This type_set stores all type variables declared till now. So once a identifier is recognized, lexer can check if this identifier is a type variable or not. If it is a type variable, then return type ident token for parser, otherwise variable ident token.

However, the key point now is how can I pass this type_set to lexer. I know ocamllex allows to pass argument at entrypoint. However, these arguments seems to be internal. Can I pass these argument from parser(menhir)?

The simplest solution is to have a global variable shared by both the lexer and the parser. This variable would contain a Hashtbl or a reference to a Set. The parser would update it and the lexer would read it.

I am not sure what you mean by internal. Any extra argument in an ocamllex rule is part of the signature of the generated function. In particular, when you pass the lexbuf argument to the main lexer function during the initialization of the parser, you also pass all the extra arguments.

If you do not want the above Hashtbl to be a global variable (e.g., because you want your lexer/parser to be reentrant), then you can create a local one during parser initialization and pass it to the lexer entrypoint.

1 Like

Thank you for elaboration!

I found the menhir manual page 30 indicate the signature for main parser function is

val main: (Lexing.lexbuf -> token) -> Lexing.lexbuf -> thing

It seems that it only accepts lexer with signature Lexing.lexbuf -> token. So I said the lexer exposed to parser cannot have extra arguments.

I think a global variable shared between lexer and parser is great. I rarely use global variable across files, and it makes sense!

I am also quite interested in how to use local variable during parser initializations and pass it to the lexer entrypoint. In fact, even though it is successfully passed to lexer entrypoint, how can I update this variable in parser semantic actions(when parser detect a typedef variable)?

Nobody forces you to pass the lexer function to the parser function in its original form. You can do arbitrary manipulations on it beforehand. In particular, you can pass all the missing arguments, so that both signatures match.

Eventually, both the lexer and the parser are transformed into plain OCaml files. If you define a global variable in a third OCaml file and both the lexer and the parser depends on this third file, then the variable is effectively shared.

From the point of view of the parser, it is still a global variable, and it is modified as such. It is just that you do not need this global variable to be put into a separate file, since it is passed as an argument to the lexer function rather than being globally visible.

But note that if you have declared your parser using the %parameter directive of Menhir, this global variable is actually a module-local one. So, you can have several separate instances of the parser, each with their own global variable. (I do not suggest to go that way, it is presumably not worth the effort.)

1 Like