Hi there, I am writing a compiler for a simplified C language, its namespace have subtle difference though. Here is a valid code snippet.
//test return 0
typedef int foo;
struct bar {
foo foo;
};
typedef struct bar bar;
int main () {
bar *p;
p = alloc(struct bar);
p = alloc(bar);
return p->foo;
}
Notice that identifier may be variable name, type name(through typedef) and field name(in struct). Function name is treated as variable name from lexer’s perspective. Here are several rules to follow
- type name cannot conflict with variable name
- field name in a struct may share the same name as a type name
- field name cannot collide with each other in the same struct
It seems inevitable to use a global variable to help lexer know which kind of identifier(VIdent, TIdent or FIdent) to return. I am using a set to record all custom types.
# env.ml
let env = ref Symbol.Set.empty
# parser.mly
gdecl :
| blahblah
| Typedef; t = dtype; var = midrule(var = VIdent {Env.add var; var}); Semicolon
{ Cst.Typedef {t = t; t_var = var} }
| Struct; var = VIdent; L_brace; fields = field_list; R_brace; Semicolon
{ Cst.Sdefn { struct_name = var; fields = fields; } }
dtype :
| blahblah
| ident = TIdent;
{ `Ctype ident }
| Struct; var = VIdent;
{ `Struct var }
# lexer.mll
rule initial = parse
| blahblah
| ident as name { let var = (Symbol.symbol name) in
if Env.mem var then Parser.TIdent var else Parser.VIdent var }
It works fine if there are only type and variable name. However, once struct is introduced, lexer needs more information to determine which kind of identifier to return. Since Menhir is applying LR(1), so I cannot just set a variable right before a identifier is lexed, something like(notice that I set ret_tident and ret_fident before read identifier)
# parser.mly
dtype :
| blahblah
| midrule({Env.ret_tident := true}); ident = TIdent;
{ Env.ret_tident := false;
`Ctype ident }
| Struct; var = VIdent;
{ `Struct var }
gdecl :
| blahblah
| Typedef; t = dtype; var = midrule(var = VIdent {Env.add var; var}); Semicolon
{ Cst.Typedef {t = t; t_var = var} }
| Struct; var = VIdent; L_brace; fields = field_list; R_brace; Semicolon
{ Cst.Sdefn { struct_name = var; fields = fields; } }
field :
| t = dtype; midrule({Env.ret_fident := true}); var = FIdent; Semicolon
{ Env.ret_fident := false;
{t = t; i = var} : Cst.field }
field_list :
|
{ [] }
| field = field; fields = field_list
{ field :: fields }
# lexer.mll
rule initial = parse
| blahblah
| ident as name {
let var = (Symbol.symbol name) in
if !Env.ret_tident
then Parser.TIdent var
else
if !Env.ret_fident
then Parser.FIdent var
else Parser.VIdent var }
So, what is a good practice to achieve a complex context dependent lexer?
I know it is quite clumsy, thank you for read it. Any ideas would be grateful!