Understanding lexbuf


Please consider below Java annoation:


I am trying to capture this annotation using below lexer rule:

let identifier_nondigit = ['A'-'Z' 'a'-'z' '_' '@' '$']
let identifier = (identifier_nondigit) (identifier_nondigit|digit)*

rule next_token = parse

| '@' identifier '(' as id
					   let annot = parse_and_include_annotation (Buffer.create 32) lexbuf; in
                       print_string ("\nannot:"^id^annot); 
                       print_string ("\nLexing.lexeme lexbuf:"^(Lexing.lexeme lexbuf))
                       (** Do some operation on lexbuf**)
and parse_and_include_annotation buff = parse
| ')' eol {
 Buffer.add_char buff ')';
 Buffer.contents buff 
| _ { 
Buffer.add_string buff (Lexing.lexeme lexbuf);
parse_and_include_annotation buff lexbuf 

Output of print statements in above rule is:

Lexing.lexeme lexbuf:)

As you can see, Lexing.lexeme lexbuf gives only last character of annotation i.e. ‘)’
It should also print full annotation string as first print statement. I want to do some operation on this lexbuf further.
But, I do not understand this behavior of lexbuf. Could you please tell me what is wrong in my code?

Thank you in advance.

lexbuf is modified in-place by the lexer. After the call to parse_and_include_annotation, the call to Lexing.lexeme lexbuf will return the last lexeme seen by the lexer, to wit: “)”.

OK, is it not possible to modify lexbuf structure? I mean, is it possible to modify lexbuf.Lexing.lex_curr_p somehow to modify the lexeme output to include full annotation text?

As a general remark, the lexbuf structure is not meant to be modified by the user – the lexing engine assumes it is the only one modifying it. You already have access to the full annotation text (annot), and I don’t see what would be gained by extracting this again from the lexbuf.

Maybe it would help the discussion if you explained what you are really trying to achieve.

With lexbuf, I want to create a Token with below information:

	Token {
		token_text = Lexing.lexeme lexbuf;
		token_line = lexbuf.Lexing.lex_curr_p.Lexing.pos_lnum;
		token_start = Lexing.lexeme_start lexbuf;
		token_end = Lexing.lexeme_end lexbuf

I don’t need only annotation text.
As you can see, I need to get token_start and token_end values also.

I need these values with respect to annotation text.
I hope, you understood my requirement.

Thank you again.

Well, you would need to get the location (using Lexing.lexeme_{start,end}_p) at when you start/finish extracting the annotation text (in the parse_and_return_annotation rule) and then propagate these two locations back to the calling function.

But really, it would be much simpler if you used a parser separate from the lexer for this. Both parser generators ocamlyacc and menhir offer a simple way to keep track of token locations.

1 Like

Thank you Nicolás. It is working.
But as you suggested, I will try parser too.
Thank you again. :slight_smile: