Lexer rule to create multiple tokens from a match

Hi!
I am writing a lexer. Typical strings to tokenize are

"Push 1"
"Push 1 2"

I am parsing a programming language and push means pushing to the stack. "Push 1" push '1' to the stack and "Push 1 2" pushes '1' and then '2'.
Obviously, "Push 1 2" is equivalent to the sequence "Push 1" "Push 2".

A simple rule would be

rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _  { PUSHES ($2, $3) }

It would be much easier to write the parser if I could produce two tokens PUSH from "Push 1 2". Push can even take more than take 2 arguments.
Is there a way of achieve this ? Is there a way to append data to lexbuf as an action for example? Something along the lines

rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _  { add lexbuf "Push $3"; PUSH $2 }

where add : Lexing.lexbuf -> string -> unit.

Thanks for your help.

Hi,

Why not recognize INT tokens, then define some value nonterminal which you could use like this?
instr: PUSH l=value+ NL { Ast.Push l }

There’s no documented function which allows you to insert data in the underlying buffer (lex_buffer).

It is actually what I am doing now.
But I would like to skip building the Ast. I want the parser to behave as the compiler.
For example I want to write

main: x = PUSH expr y = PUSH SET_VARIABLE { (* set in the context x = y *) }

Like this I am able to process directly the stack using the power of the parser. No need to build the Ast and then process it with standard ocaml code.

Actually, we could insert data in the buffer by doing lexbuf.lex_buffer <- Bytes.of_char_list data but then we have to deal with the positioning.

OK! I was able to insert data in the buffer with

  let insert_data_curr_pos (lexbuf : Lexing.lexbuf) (data : string) : unit =
    let data = String.to_list data in
    let lex_curr_pos = lexbuf.lex_curr_pos in
    let old_data = Bytes.to_list lexbuf.lex_buffer in
    let prefix, suffix = List.split_n old_data lex_curr_pos in
    let new_data = prefix @ data @ suffix in
    let new_data = Bytes.of_char_list new_data in
    lexbuf.lex_buffer <- new_data;
    lexbuf.lex_buffer_len <- Bytes.length new_data

The rule is now

rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _  { insert_data_curr_pos lexbuf ("Push " ^ (String.of_char $3)); PUSH $2 }

It works! I am not sure about how efficient it is.

The type of the generated lexer is typically

val token : Lexing.lexbuf -> <your result type here>

So another way of doing this is to write a wrapper function that can transduce the stream of tokens from the lexer. This is (was) one of the use-cases for stream-parsers, and I’d assume that it’s straightforward to do with Seq, but you can do it by-hand pretty easily, too.

let token = let ctr = ref 0 in
fun (lb : Lexing.lexbuf) ->
 let v = !ctr in
 incr ctr ;
 if !ctr mod 2 = 0 then [v] else [v; v+1]
;;
let lb = Lexing.from_string "";;
let wrap lexer =
  let buf = ref None in
  fun lb ->
  match !buf
  with Some v ->
    buf := None ; v
     | None ->
    match lexer lb with
     [v] -> v
   | [v1; v2] -> buf := Some v2 ; v1
;;
let token2 = wrap token ;;

If you invoke token repeatedly, you get a stream like

# token lb;;
- : int list = [0; 1]
# token lb;;
- : int list = [1]
# token lb;;
- : int list = [2; 3]
# token lb;;
- : int list = [3]
# token lb;;
- : int list = [4; 5]
# token lb;;
- : int list = [5]
# token lb;;
- : int list = [6; 7]
# token lb;;
- : int list = [7]

and if you invoke token2, you get

# token2 lb ;;
- : int = 8
# token2 lb ;;
- : int = 9
# token2 lb ;;
- : int = 9
# token2 lb ;;
- : int = 10
# token2 lb ;;
- : int = 11
# token2 lb ;;
- : int = 11
# token2 lb ;;
- : int = 12

You don’t have to build a concrete AST, you can do what you want with l :wink: