Good practice for interaction between mly and mll files

,

Hi there, I am writing a compiler for a simplified C language, its namespace have subtle difference though. Here is a valid code snippet.

//test return 0
typedef int foo;

struct bar {
    foo foo;
};

typedef struct bar bar;

int main () {
    bar *p;
    p = alloc(struct bar);
    p = alloc(bar);
    return p->foo;
}

Notice that identifier may be variable name, type name(through typedef) and field name(in struct). Function name is treated as variable name from lexer’s perspective. Here are several rules to follow

  1. type name cannot conflict with variable name
  2. field name in a struct may share the same name as a type name
  3. field name cannot collide with each other in the same struct

It seems inevitable to use a global variable to help lexer know which kind of identifier(VIdent, TIdent or FIdent) to return. I am using a set to record all custom types.

# env.ml
let env = ref Symbol.Set.empty
# parser.mly
gdecl :
  | blahblah
  | Typedef; t = dtype; var = midrule(var = VIdent {Env.add var; var}); Semicolon
      { Cst.Typedef {t = t; t_var = var} }
  | Struct; var = VIdent;  L_brace; fields = field_list; R_brace; Semicolon
      { Cst.Sdefn { struct_name = var; fields = fields; } }
dtype : 
  | blahblah
  | ident = TIdent;
      { `Ctype ident }
  | Struct; var = VIdent;
      { `Struct var }
# lexer.mll
rule initial = parse
  |  blahblah
  |  ident as name { let var = (Symbol.symbol name) in 
                    if Env.mem var then Parser.TIdent var else Parser.VIdent var }

It works fine if there are only type and variable name. However, once struct is introduced, lexer needs more information to determine which kind of identifier to return. Since Menhir is applying LR(1), so I cannot just set a variable right before a identifier is lexed, something like(notice that I set ret_tident and ret_fident before read identifier)

# parser.mly
dtype : 
  | blahblah
  | midrule({Env.ret_tident := true}); ident = TIdent;
      { Env.ret_tident := false;
       `Ctype ident }
  | Struct; var = VIdent;
      { `Struct var }
gdecl :
  | blahblah
  | Typedef; t = dtype; var = midrule(var = VIdent {Env.add var; var}); Semicolon
      { Cst.Typedef {t = t; t_var = var} }
  | Struct; var = VIdent; L_brace; fields = field_list; R_brace; Semicolon
      { Cst.Sdefn { struct_name = var; fields = fields; } }

field : 
  | t = dtype; midrule({Env.ret_fident := true}); var = FIdent; Semicolon
      { Env.ret_fident := false;
        {t = t; i = var} : Cst.field }

field_list :
  | 
      { [] }
  |  field = field; fields = field_list
      { field :: fields }
# lexer.mll
rule initial = parse
  |  blahblah
  |  ident as name { 
      let var = (Symbol.symbol name) in 
      if !Env.ret_tident 
      then Parser.TIdent var 
      else 
        if !Env.ret_fident 
        then Parser.FIdent var 
        else Parser.VIdent var }

So, what is a good practice to achieve a complex context dependent lexer?

I know it is quite clumsy, thank you for read it. Any ideas would be grateful!

Um, I’m sure there’s a way to make your current technique work, but why not just have one type of identifier, parse into an AST that possibly isn’t correct, and then do the checking alongside type-checking? I mean, you have to accumulate environments for structs, types, fields, variables anyway, so you can do all that checking, and just transform your AST#1 into an AST#2 that is guaranteed to be correct by the rules you enumerated.

Thank you for your reply!

That’s a good point, it indeed reduces the difficulty and I agree that handling them in the following steps(AST, typecheck, etc.) would be easier. However, this would lead to reduce/reduce conflict during parsing. So I even cannot get my AST#1. Consider following snippet

foo * bar;

If we only have one type of identifier, there are two ways to reduce it.

  1. We may treat foo as a variable, which means this is an multiplication. Treating an expression as a statement is valid in this grammar, so it is this reduce is possible.
  2. foo may be a custom type, in this case, we are declaring a pointer of type foo.

That’s why I introduce TIdent and VIdent(maybe FIdent in struct). If there is a way to solve this reduce/reduce conflict, I am very happy to stick to it :grinning:

Oh ha, I forgot about C’s horrific syntax for type declarations.

Let’s see … it seems like:

  1. you have type-ids and var-ids.
  2. field-ids can be either.
  3. so you could have a table in Env that holds all the type-ids
  4. and then, in the lexer, if you see an indentifier in that set, it’s a TIdent; otherwise, it’s a VIdent
  5. then in the parser you don’t need to tell the lexer whether you’re expecting a TIdent or not, it would seem ?
  6. and the only trick would be that for a field, the field-name can be either a TIdent or a VIdent ?

I assume you don’t have namespaces (like in C++) so once a type-name “foo” is defined, no subsequent variable can be named “foo”.

Using both quote and number has ugly indent, so I use roman number to quote.

I. you have type-ids and var-ids.

Well, if type-ids means the identifier name for types, then yes. They are stored in Env. However, we do not store variable names(if this is what you call var-ids). Identifiers not stored in Env are treated as var-ids.

II. field-ids can be either.

Yes, field-ids can be either type-ids or var-ids. However, field-ids cannot collide in the same struct. For example, the first snippet is valid, but the second is not.

typedef foo int;
struct s{
    foo foo; # fine
    foo bar; # fine, same type can appear twice
};
typedef foo int;
struct s{
    foo foo;  # fine
    foo foo;  # error. field appears twice.
};

III. so you could have a table in Env that holds all the type-ids

That’s right. It can hold other variables if necessary.

IV. and then, in the lexer, if you see an indentifier in that set, it’s a TIdent; otherwise, it’s a VIdent

This is correct until I introduce struct. Lexer cannot tell whether it is a type-id or a field-id.

V. then in the parser you don’t need to tell the lexer whether you’re expecting a TIdent or not, it would seem ?

In fact, I tried to tell lexer what it is expected to return, something like

dtype:
  | blahblah
  | midrule({Env.ret_tident:=true}); ident = TIdent;
      { `Ctype ident }

However, this ident is read before the midrule works(remember the LR(1)). So the lexer may return a var-id and lead to a parse error.

VI. and the only trick would be that for a field, the field-name can be either a TIdent or a VIdent ?

I feel like the field rule is slightly different from VIdent. For example, it can share the same name as TIdent where VIdent cannot.

I assume you don’t have namespaces (like in C++) so once a type-name “foo” is defined, no subsequent variable can be named “foo”.

Yes, all the typedef is in the same level(no namespace). A type-name can only be declared once.

Is that so? It feels like you only need TIdent or VIdent; FIdent does not matter. In particular, when you are in a position where your parser expects a field name (i.e., after . or ->), just accept any of TIdent or VIdent.

2 Likes

You can check that all fields of a struct have different names, in the action of the struct – no need to involve the lexer, yes?

And in the action for a typedef, you can add the type-name to the Env table for types.

That should allow you to not have to do anything special in the grammar, other than having two rules for fields and field-derefs:

field-decl: typeexp (TIdent | VIdent)

and

exp: exp MINUSGREATER (TIdent | VIdent)

or at least, it seems like that’s how it would work.

1 Like

Hmmm… are you saying I can write two productions with same action to accept both TIdent or VIdent where it expects a field name?

Yes, that’s what I mean.

You can check that all fields of a struct have different names, in the action of the struct – no need to involve the lexer, yes?

This is a great idea! So I can treat FIdent just the same as VIdent, and then do the check afterwards. I will update here if there are any issues!

And you might as well name the lexeme “Ident”, so you’ll have “TIdent” and “Ident”.

1 Like

I don’t have an answer but you could study other C parsers written in OCaml for inspiration:

1 Like

I’m involved in maintaining a fork of CIL so I can quickly elaborate on its approach. CIL just uses the classic lexer hack: https://github.com/goblint/cil/blob/f330065c7c496109efd33cb005930f68abff4cf4/src/frontc/lexerhack.ml. In this case with a helper module that facilitates the reverse communication between the lexer and the parser.
AFAIK CompCert’s C parser is also based on FrontC and uses the same technique.

2 Likes

Thank you for your insight! Still, I have a few questions

  1. It seems the lexer uses a table named lexicon to store all sorts of tokens. However, I only found definition of add_type in clexer.mll, and a reference assignment in init and initFromString function . I haven’t found where is add_type called. So how is a custom type added to the lexicon hash table?

  2. There are two functions: scan_ident and add_identifier, does this mean that lexer runs twice, first add the identifier to the lexicon and then return ident through scan_ident? For your convenience, I copied the snippet for these functions

let add_identifier name =
  match !context with
    [] -> raise (InternalError "Empty context stack")
  | con::sub ->
      (context := (name::con)::sub;
       (*                print_string ("adding IDENT for " ^ name ^ "\n"); *)
       H.add lexicon name (fun loc ->
         dbgToken (IDENT (name, loc))))

let scan_ident id =
  let here = currentLoc () in
  try (H.find lexicon id) here
  (* default to variable name, as opposed to type *)
  with Not_found ->
    if id.[0] = '$' then QUALIFIER(id,here) else
    dbgToken (IDENT (id, here))
  1. How does this parser distinguish type name and variable name when parsing an identifier? There is two functions add_type and add_identifier, but I cannot see in condition they are called.

Again, thank you for your reference, it is really inspiring!

The general idea for having a table in the lexer that is fed by the parser is as follows: the first time that an identifier is defined or declared, the parser calls the add_whatever() function to add the identifier to the table With whatever kind or type the identifier is. Then the next time the identifier is seen by the lexer, it uses that table to determine what kind of token it will return for the identifier – type ID, variable ID, field ID, etc.

So in your original code, in the rule for parsing “typedef ident” you call “add_type()” to register the identifier as a type-id. Then later when parsing “T * x;” the lexer looks up “T”, sees that it’s a type-id, and returns that token, instead of a generic identifier.

1 Like

Thank you for your elaboration. I try to call add_type() once parsing typedef, however, I found the lookahead is quite annoying. Consider this case

typedef int fpt;
fpt fadd(fpt x, fpt y);

And my parser looks like

gdecl :
  | blahblah
  | Typedef; t = dtype; var = VIdent; Semicolon
    {Env.add_type(t); Cst.Typedef{t=t; t_var=var}}
  | ret_type = dtype; func_name = VIdent; pars = param_list; blk = block;
    {do_something}

When the parser finished reading semicolon of typedef int fpt; the next return type is already lexed due to look ahead. And since this add_type is not finished yet, the fpt will be regarded as a VIdent.

I solved this problem by adding a midrule before Semicolon is parsed, but how is this achieved through a lookup table?

I’m not sure what you mean by your question, but … your solution is the typical one (certainly it’s what I do in such situations). Not sure what you mean by a “lookup table”.

1 Like

I checked the CIL again, and found they are not using midrule(that’s what I was curious about), but using id_or_typename, which is what you suggested before! I think I will stick to this approach and try if it works.

id_or_typename:
    IDENT				{fst $1}
|   NAMED_TYPE				{fst $1}
|   AT_NAME LPAREN IDENT RPAREN         { "@name(" ^ fst $3 ^ ")" }     /* pattern variable name */
;