Lexer rule for C# attribute/annotation

Hi,
I want to create a lexer rule to capture a C# attribute.
Example of C# attributes are:

[Serializable]
public class SampleClass
{
// Objects of this type can be serialized.
}

[System.Runtime.InteropServices.DllImport(“user32.dll”)]
extern static void SampleMethod();

[Conditional(“DEBUG”), Conditional(“TEST1”)]
void TraceMethod()
{
// …
}

In the above examples, C# attributes are in bold. So, basically I want to capture all the things between “[” and “]” inclusively.

I have written a rule as below:

//Some declarations
let text_nondigit = ['A'-'Z' 'a'-'z' '_' '@' '$']
let digit = ['0'-'9']
let text= (text_nondigit) (text_nondigit|digit)*

| '[' text
   {
     let buf = (Buffer.create 32) in
     let s = Stack.create () in							
     Stack.push '[' s;
     Buffer.add_string buf ("["^text);
     parse_csharp_annotation s buf lexbuf;
     Buffer.contents buf;				

}

And parse_csharp_annotation function is as below:

and parse_csharp_annotation s buf = parse
| ']' 			{ Stack.pop s; Buffer.add_string buf "]"; parse_csharp_annotation s 
                  buf lexbuf }
| '['			{ Stack.push '[' s; Buffer.add_string buf "["; parse_csharp_annotation s buf lexbuf}
| eol as str	{ if not (Stack.is_empty s) then begin Buffer.add_string buf str; 
                  update_line_number lexbuf;   parse_csharp_annotation s buf lexbuf end 
                }
| _ 		    {
				Buffer.add_string buf (Lexing.lexeme lexbuf); 
				parse_csharp_annotation s buf lexbuf 
			}

Now, this works for me to some extent. But I also want to exclude an array index syntax from this capture rule.
For example, An array index can have same syntax as C# attribute like:
arrayName [indexVariable]

Do you have any suggestions for this problem?

Thank you in advance.

If it’s only for that tiny fragment of C# syntax, I’d just use a parser combinator, which seems better suited for (logic heavy) pattern matching over data.

I have a half constructed example, but I realised it would not work if the index string also contains brackets, e.g. arr1[arr2[x]].

Do the attributes always start at the first column? Is it necessarily the case that strings inside the attributes do not contain square brackets?

Thank you Darren for the reply.

Users can use anything inside the string including square brackets, we never know.
I do not understand what you mean by:

Do the attributes always start at the first column?

What do you mean by first column here?

Right sorry, so as in whether it must start as the first character at a given line, or is it the case that you can attach the attribute to anywhere as long as it’s before a class.

EDIT: if it’s the latter case, I can’t think of a robust/general way of extracting the attribute section without at least having a partial abstract syntax tree of C#.

Prefix of C# attribute can only be white-spaces.
I hope this is what you meant in your question.

You can see more examples here

Okay, I see, you’ll want a AST construction to do this robustly, since attributes can occur anywhere (if they can only be used at top level, then it’d be a much easier case to handle.

I can try to whip up a small bit of code tomorrow if that’s helpful.

EDIT: I noticed that attributes seem to all be PascalCase, is this part of the syntax requirement? Also can variables use PascalCase?

If attributes must be PascalCase, and variables can only be camelCase or underscore separated, then that could give us an easy solution, though might be fragile.

EDIT2: Okay nvm, I saw some code snippets that use PascalCase for variable names.

There are some examples with lower-case prefix, such as in [module: CLSCompliant(true)].

I think that we can recognize that a opening square bracket [ is an array subscript by checking that it follows an identifier, or the closing parenthesis of an expression, or the closing square bracket of a previous array subscript (for instance, a[1][2] contains array subscripts but void MethodA([In][Out] ref double x) { } contains attributes).

Therefore, I suggest to add a rule on the regexp (ident | ')') blank* '[' to recognize first-dimension array subscripts, and add another rule of the regexp ']' blank* '[' that uses the stack s to determine whether ] closes an attribute or an array subscript.

The first thing to do, is to look at the other implementations (e.g. Mono). The second is to look at the language specification, which should give you at least EBNF.

For instance, when I wrote a Golang typechecker, I “borrowed” the yaccparser from the golang implementation (these days, they have a hand-written recursive-descent parser, feh) and that was very useful as a starting-point. Maybe Mono has a lexer written using Flex?

Good point. Only looked at the specification atm, just gonna link here for convenience:

Yeah, so already, we can see (looking at the section on attributes) that if we want to do something comprehensive, we’re going to pretty much have to parse a decent subset of C# – the productions for attributes eventually can include expressions, which will be … well, pretty much all of C#, I’m guessing.

At which point, me, I’d try to find a yacc-grammar for C#, and try to rapidly convert that into a menhir or ocamlyacc grammar. Specifically, if I’m trying to -not- understand C# completely, I’t want to get a yacc grammar because that means I can just erase all the actions, put in Ocaml actions, and be done. Maybe I can make all the actions be sprintf(), so I can even avoid having to define a buncha types.
Whereas, if I have to concoct a grammar myself, I have to actually understand the bloody language.

Thank you all of you for these good technical pointers.
For the moment, I have not understood all the things that you have mentioned here.
But this is really interesting.

There’s one more trick you might use. Let’s suppose that your read position is immediately before an attribute. So, about to read the “[”. In that case, and assuming that attributes are parenthesis-matched (more on that in a sec) you could read until the end of the attribute without actually parsing it at all.

So by parenthesis-matched, I mean that every left-paren has a matching right-paren, left-bracket, left-brace, similarly. In which case, you can parse it sort of like s-expressions, without understanding the content. I used to do this for Java source-code, in order to do aspect-oriented code-injection. No need to actually parse everything: just strings and parens/brackets/braces was sufficient. Treat everything else like “significant whitespace”.