Lexer rule to create multiple tokens from a match

intermet · February 26, 2023, 6:21pm

Hi!
I am writing a lexer. Typical strings to tokenize are

"Push 1"
"Push 1 2"

I am parsing a programming language and push means pushing to the stack. "Push 1" push '1' to the stack and "Push 1 2" pushes '1' and then '2'.
Obviously, "Push 1 2" is equivalent to the sequence "Push 1" "Push 2".

A simple rule would be

rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _  { PUSHES ($2, $3) }

It would be much easier to write the parser if I could produce two tokens PUSH from "Push 1 2". Push can even take more than take 2 arguments.
Is there a way of achieve this ? Is there a way to append data to lexbuf as an action for example? Something along the lines

rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _  { add lexbuf "Push $3"; PUSH $2 }

where add : Lexing.lexbuf -> string -> unit.

Thanks for your help.

ghuysmans · February 26, 2023, 6:40pm

Hi,

Why not recognize INT tokens, then define some value nonterminal which you could use like this?
instr: PUSH l=value+ NL { Ast.Push l }

There’s no documented function which allows you to insert data in the underlying buffer (lex_buffer).

intermet · February 26, 2023, 6:51pm

It is actually what I am doing now.
But I would like to skip building the Ast. I want the parser to behave as the compiler.
For example I want to write

main: x = PUSH expr y = PUSH SET_VARIABLE { (* set in the context x = y *) }

Like this I am able to process directly the stack using the power of the parser. No need to build the Ast and then process it with standard ocaml code.

Actually, we could insert data in the buffer by doing lexbuf.lex_buffer <- Bytes.of_char_list data but then we have to deal with the positioning.

intermet · February 26, 2023, 7:14pm

OK! I was able to insert data in the buffer with

  let insert_data_curr_pos (lexbuf : Lexing.lexbuf) (data : string) : unit =
    let data = String.to_list data in
    let lex_curr_pos = lexbuf.lex_curr_pos in
    let old_data = Bytes.to_list lexbuf.lex_buffer in
    let prefix, suffix = List.split_n old_data lex_curr_pos in
    let new_data = prefix @ data @ suffix in
    let new_data = Bytes.of_char_list new_data in
    lexbuf.lex_buffer <- new_data;
    lexbuf.lex_buffer_len <- Bytes.length new_data

The rule is now

rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _  { insert_data_curr_pos lexbuf ("Push " ^ (String.of_char $3)); PUSH $2 }

It works! I am not sure about how efficient it is.

Chet_Murthy · February 26, 2023, 8:00pm

The type of the generated lexer is typically

val token : Lexing.lexbuf -> <your result type here>

So another way of doing this is to write a wrapper function that can transduce the stream of tokens from the lexer. This is (was) one of the use-cases for stream-parsers, and I’d assume that it’s straightforward to do with Seq, but you can do it by-hand pretty easily, too.

let token = let ctr = ref 0 in
fun (lb : Lexing.lexbuf) ->
 let v = !ctr in
 incr ctr ;
 if !ctr mod 2 = 0 then [v] else [v; v+1]
;;
let lb = Lexing.from_string "";;
let wrap lexer =
  let buf = ref None in
  fun lb ->
  match !buf
  with Some v ->
    buf := None ; v
     | None ->
    match lexer lb with
     [v] -> v
   | [v1; v2] -> buf := Some v2 ; v1
;;
let token2 = wrap token ;;

If you invoke token repeatedly, you get a stream like

# token lb;;
- : int list = [0; 1]
# token lb;;
- : int list = [1]
# token lb;;
- : int list = [2; 3]
# token lb;;
- : int list = [3]
# token lb;;
- : int list = [4; 5]
# token lb;;
- : int list = [5]
# token lb;;
- : int list = [6; 7]
# token lb;;
- : int list = [7]

and if you invoke token2, you get

# token2 lb ;;
- : int = 8
# token2 lb ;;
- : int = 9
# token2 lb ;;
- : int = 9
# token2 lb ;;
- : int = 10
# token2 lb ;;
- : int = 11
# token2 lb ;;
- : int = 11
# token2 lb ;;
- : int = 12

ghuysmans · February 27, 2023, 1:15pm

You don’t have to build a concrete AST, you can do what you want with l

Topic		Replies	Views
How do I keep asking for characters for my lexer? Learning	3	613	November 9, 2019
A lexer using the Seq module Learning	9	1291	May 29, 2022
How does one specify types of functions for lexers in ocamlex? Learning	1	515	November 18, 2019
How to write lexer rules correctly Learning ocamllex	1	1367	July 18, 2021
What is the eof token in OCaml/OCamlex? Learning	1	1719	November 9, 2019

Lexer rule to create multiple tokens from a match

Related topics