A lexer using the Seq module

I am writing a lexer as part of a project to learn OCaml. Thanks to @octachron’s advice at https://discuss.ocaml.org/t/read-a-file-line-by-line-and-print-line-numbers/8846/29, I looked at the Seq module and defined a lexer of the form

val lexer : char automaton -> in_channel -> token Seq.t = <fun>

which I’ll then pass to a parser. Here is the (compiling) code below. Is this the way to go as for the choice of modules (Seq, Buffer) and the general architecture ? (Any other advice or pointing at newbie mistakes welcome.)

let is_empty t = Buffer.length t = 0 (* why is it not in Stdlib/Buffer ? *)

exception IllegalToken of string

let charseq_of_inchan =
  let read_char ic = Option.map (fun c -> (c, ic)) (In_channel.input_char ic) in
  Seq.unfold read_char

type 'a automaton = {
  transmap : int -> 'a -> int;
  initial : int;
  final : int list;
  sink : int;
  skippable : 'a list; (* typically, whitespaces *)
}

type token = string (* TODO later: make it a record and add position *)

let tokenseq_of_charseq aut =
  let read_tok seq =
    let rec aux buf state seq =
      (* buf contains the beginning of the current lexeme *)
      match Seq.uncons seq with
      | None -> None
      | Some (c, seq') -> (
          if List.mem c aut.skippable then
            if is_empty buf then aux buf aut.initial seq'
            else Some (Buffer.contents buf, seq')
          else
            match aut.transmap state c with
            | state' when state' = aut.sink ->
                raise (IllegalToken (Buffer.contents buf))
            | state' when List.mem state' aut.final ->
                Some (Buffer.contents buf, seq')
            | state' ->
                Buffer.add_char buf c;
                aux buf state' seq')
    in
    aux (Buffer.create 16) aut.initial seq
  in
  Seq.unfold read_tok

let lexer aut ic = (tokenseq_of_charseq aut) (charseq_of_inchan ic)

A few random remarks:

  • You don’t return the state of the automaton in your token type. This probably isn’t very important for now, as your states do not carry much information, but typically a lexer’s automaton will recognise the kind of token by the state in which it ends so I assume that you’ll want to extract that later.
  • I’m personally not fond of Seq, so I won’t have much to say on it; but I think at least that your mechanism for generating a stream of characters from an input channel is way too complicated. A simple Seq.of_dispenser (fun () -> In_channel.input_char ic) would work.
  • Buffer.is_empty isn’t exposed because you almost never need it. That includes your own usage: you’re using the emptiness of the buffer as a proxy for knowing whether you have encountered any non-slippable characters, and you have a lot of other straightforward ways to do that. My favourite would be to encode it in the control flow, with a first loop that skips all skippable characters, and then a second loop that handles characters until a skippable one is reached. If you don’t want to change your code too much you can simply add a boolean parameter to aux (before_first_transition or something like that) and use it instead of Buffer.is_empty.

Regarding the general approach, I don’t have much to say. It is one of several possible approaches, each with its own pros and cons. I personally find it wasteful to reconstruct strings character by character when you could have used String.sub from the initial input, but if you want your lexer to work with arbitrary character streams then your approach works better. (I also tend to be much more concerned about performance than I should be, so all this wrapping around in Seqs is making me uncomfortable.)

3 Likes

Thanks for these very useful remarks.

Good point. I guess there is a tradeoff with how complex you allow your automaton to be ? I’ll have to think more of it.

Thanks for the information. I was advised against Stream last year (before the deprecation notice), and against Str a few days ago here, and now against Seq… So we cannot even be confident about things in the Stdlib ? This kind of information is hard to obtain for a self-learner. Fortunately this forum exists.

I didn’t know Seq.of_dispenser. Your proposal seems to be semantically equivalent and much simpler than my function.

Actually, if a skippable is met, then if my buffer is non-empty, then I “flush it out”, and if it is empty (meaning it has been flushed at the previous step and I’m typically reading a second, third… skippable), then I continue skipping. But your idea to encode it in the control flow seems preferable indeed.

I don’t mind changing my code at this early stage: that was the point of asking pretty early about the architecture decisions, so that refactoring now is not costly.

I found my signature char automaton -> char Seq.t -> token Seq.t pretty neat :frowning:
Actually, I should be concerned about performance too since I want to parse files that are tens of megabytes long reasonably fast (though I would favor a simpler code even if slightly slower). Could you give a hint about the method you began to describe ? I tried an approach with In_channel.input_line and then lexing a string, but some tokens might span several lines and it seemed a bit complex. (Actually, only the skippables might span several lines, typically multiline comments which may contain characters which are forbidden outside comments.)

If there are commendable blogposts or tutorials on these matters, they are also welcomed.

In this case, this is more a personal view than general good practice. The Seq module is a fairly recent addition, well documented and maintained. If you want to use sequences it is better to use it then alternatives like Stream. I just don’t think sequences are a useful concept in general, at least in the kind of programs I write.
As an example, while the result type (token Seq.t) looks reasonable, the input char Seq.t is silly: things from which you can read characters is more or less equivalent to In_channel.t, so you might as well use that.

If you’re only concerned about inputs smaller than a gigabyte, I would start by reading the input into a single string (using In_channel.input_all), and depending on what you want to do next either extract a substring for each token (leaving the GC to reclaim the big string once all input has been handled) or keep the string alive during the whole of your program and manipulate indices (or pairs of indices) instead of substrings.
For bigger inputs, or if you don’t like this idea for other reasons, you should still have a mechanism for reading characters by chunks rather than one by one. If you want to continue experiencing with sequences, you could have a first function that builds a chunk Seq.t out of an input channel (there are various options for what a chunk would be, but for simplicity this could be a string computed using In_channel.input_line, with more efficient versions using a fixed-size buffer and In_channel.input). Then you could build a char Seq.t out of this sequence of chunks (for example using String.to_seq and Seq.concat). With the right compiler settings, you might end up with a reasonably efficient result.

But my experience is that it’s quite hard to predict what performance will be like if you have very fine-grained sequences. So I would advise to use a more compact data structure for chunks of characters (typically String, Bytes, or Bigstring), and use low-level access as long as you’re in the same chunk. The mechanism for getting a new chunk doesn’t have to be particularly fast, so you can still use chunk Seq.t for that if you want.
As a concrete example, you could have:

type chunk = {
  contents : string;
  length : int;
  offset : int;
}

type state = {
  current_chunk : chunk;
  chunk_sequence : chunk Seq.t;
}

let rec refresh_chunk chunk_sequence =
  match chunk_sequence () with
  | Nil -> None
  | Cons (current_chunk, chunk_sequence) ->
    if current_chunk.offset < current_chunk.length
    then Some { current_chunk; chunk_sequence }
    else refresh_state chunk_sequence

let rec get_char state =
  if state.current_chunk.offset >= state.current_chunk.length
  then match refresh_state state.chunk_sequence with
    | None -> None
    | Some state -> get_char state
  else
    let c = state.current_chunk.contents.[state.current_chunk.offset] in
    let current_chunk = { state.current_chunk with offset = offset + 1 } in
    Some (c, {state with current_chunk })

You can get something slightly more efficient if you use mutable chunks (no need to reallocate the state except when fetching a new chunk).

2 Likes

Another thought; it won’t work on pipes or sockets, but another option is using Unix.map_file to map the input file to a BigArray, and iterating over that, instead.

I was under the impression that an “in_channel” performed its own buffering. As long as the source of data is an in_channel, adding a “chunking” layer on top would be redundant, no?

Also, this is just me, but for splitting up a stream of bytes into tokens, the first thing that comes to mind for any remotely complex tokenization rules is ocamllex. It may seem like a dusty, intimidating tool for compiler wizards, but if you already have a grasp of regular expressions it is truly very easy to use. And just because you’re using ocamllex does not mean you have to use a parser generator like ocamlyacc or menhir; you can still get your token sequence with something like (untested):

(* assumes you've written lexer.mll with rule 'read' *)
let token_seq channel =
  let buf = Lexing.from_channel channel in
  let next_token () =
    match Lexer.read buf with
    | exception End_of_file -> None
    | token -> Some token
  in
  Seq.of_dispenser next_token
2 Likes

Not completely. The buffer for in_channel is on the C side, so you still get a C call for each character if you use input_char. Using an intermediate buffer will keep the number of C calls down.

2 Likes

Having a simple model is nice, and I personally think you should never add complexity unless absolutely necessary performance-wise.

I am also curious why OP doesn’t want to use ocamllex…

I’m not sure myself ! I thought I would then have to use Menhir, but @droyo gave a way out. I am a bit wary of depending on an external code generation tool, but maybe this is not justified ? On the other hand, using OCamllex would maybe force me to write the syntax of my language in a standardized format ? Not sure how to weigh the pros and cons.

In any case, my lexer now looks like:

type position = { mutable curr : int; mutable bol : int; mutable line : int }
type token = { lexeme : lexeme; position : position }

let next_token str pos = [... ~100 lines ...]

let lexer str = Seq.unfold (next_token str) { curr = 0; bol = 0; line = 1 }

let filecontent filename = In_channel.(with_open_text filename input_all)

let test_lexer filename =
  filename |> filecontent |> lexer
  |> Seq.iter (fun x -> print_endline (token_to_string x));;

val next_token : string -> position -> (token * position) option = <fun>
val lexer : string -> token Seq.t = <fun>
val filecontent : string -> string = <fun>
val test_lexer : string -> unit = <fun>

Oh, I figured you were doing it as part of “learning”. Which … isn’t a bad thing. The first bit project in my first big programming class (that consumed every waking hour of my semester, and with the help of no-doze, many of what should have been sleep hours) was a “tokenizer” which was basically a hand-written lexer for a typical programming language. I think ocamllex (sedlex, genlex, whatever) is a perfectly fine tool. Much better than writing one by hand. But having written one by hand for a moderately-complex token specification is … educational. Once.

Actually, it’s half and half. For now, I have a working lexer with correct error-reporting and position-tracking, so I’ll work on other parts (parser, building an AST using Should I use objects in this case?, and interface using Parsing one's own command line). Then I might come back and use ocamllex.