Hi!
I am writing a lexer. Typical strings to tokenize are
"Push 1"
"Push 1 2"
I am parsing a programming language and push means pushing to the stack. "Push 1"
push '1'
to the stack and "Push 1 2"
pushes '1'
and then '2'.
Obviously, "Push 1 2"
is equivalent to the sequence "Push 1" "Push 2"
.
A simple rule would be
rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _ { PUSHES ($2, $3) }
It would be much easier to write the parser if I could produce two tokens PUSH
from "Push 1 2"
. Push can even take more than take 2 arguments.
Is there a way of achieve this ? Is there a way to append data to lexbuf as an action for example? Something along the lines
rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _ { add lexbuf "Push $3"; PUSH $2 }
where add : Lexing.lexbuf -> string -> unit
.
Thanks for your help.
Hi,
Why not recognize INT
tokens, then define some value
nonterminal which you could use like this?
instr: PUSH l=value+ NL { Ast.Push l }
There’s no documented function which allows you to insert data in the underlying buffer (lex_buffer
).
It is actually what I am doing now.
But I would like to skip building the Ast. I want the parser to behave as the compiler.
For example I want to write
main: x = PUSH expr y = PUSH SET_VARIABLE { (* set in the context x = y *) }
Like this I am able to process directly the stack using the power of the parser. No need to build the Ast and then process it with standard ocaml code.
Actually, we could insert data in the buffer by doing lexbuf.lex_buffer <- Bytes.of_char_list data
but then we have to deal with the positioning.
OK! I was able to insert data in the buffer with
let insert_data_curr_pos (lexbuf : Lexing.lexbuf) (data : string) : unit =
let data = String.to_list data in
let lex_curr_pos = lexbuf.lex_curr_pos in
let old_data = Bytes.to_list lexbuf.lex_buffer in
let prefix, suffix = List.split_n old_data lex_curr_pos in
let new_data = prefix @ data @ suffix in
let new_data = Bytes.of_char_list new_data in
lexbuf.lex_buffer <- new_data;
lexbuf.lex_buffer_len <- Bytes.length new_data
The rule is now
rule token = parse
| "Push " _ { PUSH $2 }
| "Push " _ _ { insert_data_curr_pos lexbuf ("Push " ^ (String.of_char $3)); PUSH $2 }
It works! I am not sure about how efficient it is.
The type of the generated lexer is typically
val token : Lexing.lexbuf -> <your result type here>
So another way of doing this is to write a wrapper function that can transduce the stream of tokens from the lexer. This is (was) one of the use-cases for stream-parsers, and I’d assume that it’s straightforward to do with Seq
, but you can do it by-hand pretty easily, too.
let token = let ctr = ref 0 in
fun (lb : Lexing.lexbuf) ->
let v = !ctr in
incr ctr ;
if !ctr mod 2 = 0 then [v] else [v; v+1]
;;
let lb = Lexing.from_string "";;
let wrap lexer =
let buf = ref None in
fun lb ->
match !buf
with Some v ->
buf := None ; v
| None ->
match lexer lb with
[v] -> v
| [v1; v2] -> buf := Some v2 ; v1
;;
let token2 = wrap token ;;
If you invoke token
repeatedly, you get a stream like
# token lb;;
- : int list = [0; 1]
# token lb;;
- : int list = [1]
# token lb;;
- : int list = [2; 3]
# token lb;;
- : int list = [3]
# token lb;;
- : int list = [4; 5]
# token lb;;
- : int list = [5]
# token lb;;
- : int list = [6; 7]
# token lb;;
- : int list = [7]
and if you invoke token2, you get
# token2 lb ;;
- : int = 8
# token2 lb ;;
- : int = 9
# token2 lb ;;
- : int = 9
# token2 lb ;;
- : int = 10
# token2 lb ;;
- : int = 11
# token2 lb ;;
- : int = 11
# token2 lb ;;
- : int = 12
You don’t have to build a concrete AST, you can do what you want with l