Which data type to use as an input channel for a parser?

I have an API for which I would like to add a function that parses a specific data structure from its text representation.

I currently want to process the input on a per-line basis, but this is just how I currently do it, and it might change in future. So ideally, I wouldn’t want to reflect that implementation detail in the interface.

What data type should I use for the “input channel”. I see the following options (using stdlib) so far:

All variants above seem to come with disadvantages:

  • Reading on a per-line basis seems to be somewhat troublesome with Scanf.Scanning.in_channel. I can use Scanf.bscanf input "%[^\n]\n" Fun.id, but is this efficient? Also, it will silently discard a non-terminated line at the end of input.
  • I didn’t find any function that allows me to create an In_channel.t from a string. Thus, if I have already data in memory that I want to parse, would I need to write it to a temporary file first to parse it? That doesn’t seem nice.
  • Parsing only from string would require to load a big file into memory first. Also not the best idea.

Specifically with regard to supporting parsing both from a file (or maybe a network socket later) and from memory (e.g. for test cases), do I really have to provide multiple interfaces for that? And what is the way to deal with non-terminated lines at the end of a file?

Some source code I experimented with to demonstrate the available options and their issues:

type result = Result (* Some parsing result. *)

let parser_using_scanf : Scanf.Scanning.in_channel -> result =
 fun input ->
  try
    while true do
      let line = Scanf.bscanf input "%[^\n]\n" Fun.id in
      Printf.printf "Got: %s\n%!" line
    done
  with End_of_file ->
    Result

let parser_using_input_channel : In_channel.t -> result =
 fun input ->
  try
    while true do
      let line =
        match In_channel.input_line input with
        | None -> raise End_of_file
        | Some x -> x
      in
      Printf.printf "Got: %s\n%!" line
    done
  with End_of_file ->
    Result

let parser_using_string : string -> result =
 fun input -> parser_using_scanf (Scanf.Scanning.from_string input)

let _ =
  (* This works fine: *)
  ignore @@ parser_using_scanf (Scanf.Scanning.from_string "ABC\nDEF\n");

  (* This ignores "DEF": *)
  ignore @@ parser_using_scanf (Scanf.Scanning.from_string "ABC\nDEF");

  (* Here, I don't know how to read from a string in memory,
     but at least an incomplete line is not ignored: *)
  (*   ignore @@ parser_using_input_channel In_channel.stdin; *)

  (* This works fine, but reading from a file would force me to load
     everything into memory first: *)
  ignore @@ parser_using_string "ABC\nDEF\n";

  (* This ignores "DEF" again (due to its current implementation): *)
  ignore @@ parser_using_string "ABC\nDEF"

Thanks for your opinions on this issue. :folded_hands:

Well, I did find a solution for that part, but it’s platform-dependent and not really nice. Sharing for curiosity though:

#require "unix"

type result = Result (* Some parsing result. *)

let parser_using_input_channel : In_channel.t -> result =
 fun input ->
  try
    while true do
      let line =
        match In_channel.input_line input with
        | None -> raise End_of_file
        | Some x -> x
      in
      Printf.printf "Got: %s\n%!" line
    done
  with End_of_file -> Result

let parser_using_string : string -> result =
 fun str ->
  let input_fd, output_fd = Unix.pipe ~cloexec:true () in
  let input = Unix.in_channel_of_descr input_fd in
  let output = Unix.out_channel_of_descr output_fd in
  let output_task =
    Domain.spawn begin fun () ->
        Fun.protect ~finally:(fun () -> Out_channel.close_noerr output)
          begin fun () ->
            Out_channel.output_string output str;
            Out_channel.flush output
          end
      end
  in
  let result = parser_using_input_channel input in
  Domain.join output_task;
  result

let _ = ignore @@ parser_using_string "ABC\nDEF"

Opposed to the solutions using Scanf, this doesn’t silently discard the last non-terminated line at least.

There’s also Lexing.lexbuf (the input type for ocamllex).

But regarding your question: the input type is often related to the implementation of the parser, so my first reaction would be to define your own abstract type input with suitable constructors (eg val input_of_channel: in_channel -> input) instead of trying to reuse an existing stdlib type. Then you are free to customize buffering and other aspects as you see fit.

See modular IO by c-cube · Pull Request #19 · ocaml/RFCs · GitHub for a proposal in this direction.

Cheers,
Nicolas

I can’t claim to have used it, but I did bookmark bytesrw 0.3.0 (latest) · OCaml Package because it looked very useful for this sort of thing.

Readers can be created from bytes, strings, slices, input channels, file descriptors, etc. More generally any function that enumerates a stream’s slices can be turned into a byte stream reader.

That makes sense and seems to be most flexible.

Putting it together with a solution for the non-terminated lines, I wrote up this:

type result = Result
(* Some parsing result. *)

type in_channel = Scanf.Scanning.in_channel
(* Input channel, will be abstract in interface. *)

let parse : in_channel -> result =
 fun input ->
  try
    while true do
      let line = Scanf.bscanf input "%[^\n]" Fun.id in
      (try Scanf.bscanf input "\n" ()
       with End_of_file -> if line = "" then raise End_of_file);
      Printf.printf "Got: %s\n%!" line
    done
  with End_of_file -> Result

let channel_of_scanf_channel : Scanf.Scanning.in_channel -> in_channel =
  Fun.id

let channel_of_in_channel : In_channel.t -> in_channel =
  Scanf.Scanning.from_channel

let channel_of_string : string -> in_channel =
  Scanf.Scanning.from_string

let _ =
  (* This works fine: *)
  ignore @@ parse (channel_of_string "ABC\nDEF\n");

  (* And this works fine too now: *)
  ignore @@ parse (channel_of_string "ABC\nDEF")

Looks fine this way? Or any other recommendations and/or warnings? Or tricks to avoid the extra call try Scanf.bscanf input "\n" ()? I didn’t find any way to denote an optional linebreak in the format string.

I’ll keep that in mind, thanks, though I would prefer using stdlib for this, if possible.

Or using a more line-based approach internally:

type result = Result
(* Some parsing result. *)

type in_channel = unit -> string option
(* Input channel, will be abstract in interface. *)

let parse : in_channel -> result =
 fun input ->
  let next_line () =
    match input () with None -> raise End_of_file | Some line -> line
  in
  try
    while true do
      let line = next_line () in
      Printf.printf "Got: %s\n%!" line
    done
  with End_of_file -> Result

let from_scanf_channel : Scanf.Scanning.in_channel -> in_channel =
 fun input ->
  fun () ->
   try
     let line = Scanf.bscanf input "%[^\n]" Fun.id in
     (try Scanf.bscanf input "\n" ()
      with End_of_file -> if line = "" then raise End_of_file);
     Some line
   with End_of_file -> None

let channel_of_in_channel : In_channel.t -> in_channel =
 fun input -> fun () -> In_channel.input_line input

let channel_of_string : string -> in_channel =
 fun s -> from_scanf_channel (Scanf.Scanning.from_string s)

(* Alternative implementation: *)
let channel_of_string' : string -> in_channel =
 fun s ->
  let lines = ref (String.split_on_char '\n' s) in
  fun () ->
    match !lines with
    | [] | [ "" ] -> None
    | x :: xs ->
        lines := xs;
        Some x

let channel_of_line_seq : string Seq.t -> in_channel = Seq.to_dispenser

let _ =
  (* This works fine: *)
  ignore @@ parse (channel_of_string' "ABC\nDEF\n");

  (* And this works fine too now: *)
  ignore @@ parse (channel_of_string' "ABC\nDEF")

Using the alternative channel_of_string' implementation, it would allow me to completely get rid of Scanf.

I have been thinking about this more. An issue is that I lose interoperability of course. For example, if I do not just want to parse my own data types from that stream but also other data (using third-party libraries or the stdlib), this isn’t trivially possible. For example, if I turn strings in memory into my in_channel, then I would need to also provide a function to turn the not-parsed part of this in_channel back into a string, I guess. Or I would need to provide some sort of interface that allows other parsers (byte stream consumers?) to access my in_channel.

In theory, I would rather see structually typed signatures as an “interface” for such I/O operations and then do my parsing within a functor that takes such an interface as an argument (but that might be unhandy in practice). Or, we need a common concrete interface for I/O that is interoperable. Maybe Scanf.Scanning.in_channel is such an interface provided by the stdlib (when it comes to parsing), but I’m not sure what to think. Or a third-party library (which I mostly rule out for my own use-case).