A case for `In_channel.peek_char`

Just noting publicly something that I stumbled upon while writing a small UTF-8 decoder: this seems to be a case where the strange ungetc() function from the C standard library, would prove useful. man 3 ungetc reads (emphasis mine):

ungetc() pushes c back to stream, cast to unsigned char, where it is available for subsequent read operations. Pushed-back characters will be returned in reverse order; only one pushback is guaranteed.

In a UTF-8 decoder we need it to handle invalid UTF-8 sequences. The sequence is read byte-by-byte; if the new byte makes the current sequence invalid, then we should do something with the invalid sequence, then start decoding again from the new byte. So conceptually, we push the new byte back into the input stream (unget_char); or, equivalently, we read from the stream it but don’t consume it (peek_char) until we are sure to proceed with it. We never need to peek / push back more than one byte.

OCaml’s in_channel does not have these operations. Of course one can use an additional buffer to store the maybe-read last byte, but then it is a bit annoying because (1) it requires implementation effort from the user of in_channel, (2) it adds an abstraction on top of in_channel (if done in a reusable way), and (3) it adds a bit of boxing. I looked at the C implementation of in_channel a while ago: it seems to me that, considering how the input buffer is handled, it is never empty, so peek_char could be implemented there today with near-zero implementation effort.

unget_char is a peculiar interface indeed (what if I’m at the start of the file? what if I try to push a byte different from the last byte read?), but peek_char feels relatively natural, and should be easy to implement (if the buffer is empty, then inputs the next chunk but do not advance the offset), only with the perhaps-surprising constraint that we cannot peek more than one byte.

My 2 cents. I guess this has been discussed before.

PS: My code is also an example where we would like to be able to produce an in_channel wrapper that would read from a string. Which is not possible with OCaml.

PPS: I guess some folks here might have an opinion on my handling of UTF-8. :wink:

I’m going to be touting my own horn, as often, but I think the stdlib
channels are one of its weakest parts (among the ones that are commonly
used).

I experimented a bit with a OCaml extensible version here
GitHub - c-cube/poc-modular-io: proof of concept for https://github.com/ocaml/RFCs/pull/19 following
this
RFC
; I
think in the case you’re describing ungetc is not great, but having
access to the buffer is good.

C’s IO library is bad and shouldn’t be used as an example of what to do,
generally speaking. I think Go has good ones, but also see Rust’s
distinction between
Read (like In.t in
my POC above) and
BufRead (like
In_buffered.t), which is what you’d use for parsing/lexing.

4 Likes