I have an API for which I would like to add a function that parses a specific data structure from its text representation.
I currently want to process the input on a per-line basis, but this is just how I currently do it, and it might change in future. So ideally, I wouldn’t want to reflect that implementation detail in the interface.
What data type should I use for the “input channel”. I see the following options (using stdlib) so far:
All variants above seem to come with disadvantages:
Reading on a per-line basis seems to be somewhat troublesome with Scanf.Scanning.in_channel. I can use Scanf.bscanf input "%[^\n]\n" Fun.id, but is this efficient? Also, it will silently discard a non-terminated line at the end of input.
I didn’t find any function that allows me to create an In_channel.t from a string. Thus, if I have already data in memory that I want to parse, would I need to write it to a temporary file first to parse it? That doesn’t seem nice.
Parsing only from string would require to load a big file into memory first. Also not the best idea.
Specifically with regard to supporting parsing both from a file (or maybe a network socket later) and from memory (e.g. for test cases), do I really have to provide multiple interfaces for that? And what is the way to deal with non-terminated lines at the end of a file?
Some source code I experimented with to demonstrate the available options and their issues:
type result = Result (* Some parsing result. *)
let parser_using_scanf : Scanf.Scanning.in_channel -> result =
fun input ->
try
while true do
let line = Scanf.bscanf input "%[^\n]\n" Fun.id in
Printf.printf "Got: %s\n%!" line
done
with End_of_file ->
Result
let parser_using_input_channel : In_channel.t -> result =
fun input ->
try
while true do
let line =
match In_channel.input_line input with
| None -> raise End_of_file
| Some x -> x
in
Printf.printf "Got: %s\n%!" line
done
with End_of_file ->
Result
let parser_using_string : string -> result =
fun input -> parser_using_scanf (Scanf.Scanning.from_string input)
let _ =
(* This works fine: *)
ignore @@ parser_using_scanf (Scanf.Scanning.from_string "ABC\nDEF\n");
(* This ignores "DEF": *)
ignore @@ parser_using_scanf (Scanf.Scanning.from_string "ABC\nDEF");
(* Here, I don't know how to read from a string in memory,
but at least an incomplete line is not ignored: *)
(* ignore @@ parser_using_input_channel In_channel.stdin; *)
(* This works fine, but reading from a file would force me to load
everything into memory first: *)
ignore @@ parser_using_string "ABC\nDEF\n";
(* This ignores "DEF" again (due to its current implementation): *)
ignore @@ parser_using_string "ABC\nDEF"
Well, I did find a solution for that part, but it’s platform-dependent and not really nice. Sharing for curiosity though:
#require "unix"
type result = Result (* Some parsing result. *)
let parser_using_input_channel : In_channel.t -> result =
fun input ->
try
while true do
let line =
match In_channel.input_line input with
| None -> raise End_of_file
| Some x -> x
in
Printf.printf "Got: %s\n%!" line
done
with End_of_file -> Result
let parser_using_string : string -> result =
fun str ->
let input_fd, output_fd = Unix.pipe ~cloexec:true () in
let input = Unix.in_channel_of_descr input_fd in
let output = Unix.out_channel_of_descr output_fd in
let output_task =
Domain.spawn begin fun () ->
Fun.protect ~finally:(fun () -> Out_channel.close_noerr output)
begin fun () ->
Out_channel.output_string output str;
Out_channel.flush output
end
end
in
let result = parser_using_input_channel input in
Domain.join output_task;
result
let _ = ignore @@ parser_using_string "ABC\nDEF"
Opposed to the solutions using Scanf, this doesn’t silently discard the last non-terminated line at least.
There’s also Lexing.lexbuf (the input type for ocamllex).
But regarding your question: the input type is often related to the implementation of the parser, so my first reaction would be to define your own abstract type input with suitable constructors (eg val input_of_channel: in_channel -> input) instead of trying to reuse an existing stdlib type. Then you are free to customize buffering and other aspects as you see fit.
Readers can be created from bytes, strings, slices, input channels, file descriptors, etc. More generally any function that enumerates a stream’s slices can be turned into a byte stream reader.
Putting it together with a solution for the non-terminated lines, I wrote up this:
type result = Result
(* Some parsing result. *)
type in_channel = Scanf.Scanning.in_channel
(* Input channel, will be abstract in interface. *)
let parse : in_channel -> result =
fun input ->
try
while true do
let line = Scanf.bscanf input "%[^\n]" Fun.id in
(try Scanf.bscanf input "\n" ()
with End_of_file -> if line = "" then raise End_of_file);
Printf.printf "Got: %s\n%!" line
done
with End_of_file -> Result
let channel_of_scanf_channel : Scanf.Scanning.in_channel -> in_channel =
Fun.id
let channel_of_in_channel : In_channel.t -> in_channel =
Scanf.Scanning.from_channel
let channel_of_string : string -> in_channel =
Scanf.Scanning.from_string
let _ =
(* This works fine: *)
ignore @@ parse (channel_of_string "ABC\nDEF\n");
(* And this works fine too now: *)
ignore @@ parse (channel_of_string "ABC\nDEF")
Looks fine this way? Or any other recommendations and/or warnings? Or tricks to avoid the extra call try Scanf.bscanf input "\n" ()? I didn’t find any way to denote an optional linebreak in the format string.
I’ll keep that in mind, thanks, though I would prefer using stdlib for this, if possible.
type result = Result
(* Some parsing result. *)
type in_channel = unit -> string option
(* Input channel, will be abstract in interface. *)
let parse : in_channel -> result =
fun input ->
let next_line () =
match input () with None -> raise End_of_file | Some line -> line
in
try
while true do
let line = next_line () in
Printf.printf "Got: %s\n%!" line
done
with End_of_file -> Result
let from_scanf_channel : Scanf.Scanning.in_channel -> in_channel =
fun input ->
fun () ->
try
let line = Scanf.bscanf input "%[^\n]" Fun.id in
(try Scanf.bscanf input "\n" ()
with End_of_file -> if line = "" then raise End_of_file);
Some line
with End_of_file -> None
let channel_of_in_channel : In_channel.t -> in_channel =
fun input -> fun () -> In_channel.input_line input
let channel_of_string : string -> in_channel =
fun s -> from_scanf_channel (Scanf.Scanning.from_string s)
(* Alternative implementation: *)
let channel_of_string' : string -> in_channel =
fun s ->
let lines = ref (String.split_on_char '\n' s) in
fun () ->
match !lines with
| [] | [ "" ] -> None
| x :: xs ->
lines := xs;
Some x
let channel_of_line_seq : string Seq.t -> in_channel = Seq.to_dispenser
let _ =
(* This works fine: *)
ignore @@ parse (channel_of_string' "ABC\nDEF\n");
(* And this works fine too now: *)
ignore @@ parse (channel_of_string' "ABC\nDEF")
Using the alternative channel_of_string' implementation, it would allow me to completely get rid of Scanf.
I have been thinking about this more. An issue is that I lose interoperability of course. For example, if I do not just want to parse my own data types from that stream but also other data (using third-party libraries or the stdlib), this isn’t trivially possible. For example, if I turn strings in memory into my in_channel, then I would need to also provide a function to turn the not-parsed part of this in_channel back into a string, I guess. Or I would need to provide some sort of interface that allows other parsers (byte stream consumers?) to access my in_channel.
In theory, I would rather see structually typed signatures as an “interface” for such I/O operations and then do my parsing within a functor that takes such an interface as an argument (but that might be unhandy in practice). Or, we need a common concrete interface for I/O that is interoperable. Maybe Scanf.Scanning.in_channel is such an interface provided by the stdlib (when it comes to parsing), but I’m not sure what to think. Or a third-party library (which I mostly rule out for my own use-case).