Just noting publicly something that I stumbled upon while writing a small UTF-8 decoder: this seems to be a case where the strange ungetc()
function from the C standard library, would prove useful. man 3 ungetc
reads (emphasis mine):
ungetc()
pushes c back to stream, cast tounsigned char
, where it is available for subsequent read operations. Pushed-back characters will be returned in reverse order; only one pushback is guaranteed.
In a UTF-8 decoder we need it to handle invalid UTF-8 sequences. The sequence is read byte-by-byte; if the new byte makes the current sequence invalid, then we should do something with the invalid sequence, then start decoding again from the new byte. So conceptually, we push the new byte back into the input stream (unget_char
); or, equivalently, we read from the stream it but don’t consume it (peek_char
) until we are sure to proceed with it. We never need to peek / push back more than one byte.
OCaml’s in_channel
does not have these operations. Of course one can use an additional buffer to store the maybe-read last byte, but then it is a bit annoying because (1) it requires implementation effort from the user of in_channel
, (2) it adds an abstraction on top of in_channel
(if done in a reusable way), and (3) it adds a bit of boxing. I looked at the C implementation of in_channel
a while ago: it seems to me that, considering how the input buffer is handled, it is never empty, so peek_char
could be implemented there today with near-zero implementation effort.
unget_char
is a peculiar interface indeed (what if I’m at the start of the file? what if I try to push a byte different from the last byte read?), but peek_char
feels relatively natural, and should be easy to implement (if the buffer is empty, then inputs the next chunk but do not advance the offset), only with the perhaps-surprising constraint that we cannot peek more than one byte.
My 2 cents. I guess this has been discussed before.
PS: My code is also an example where we would like to be able to produce an in_channel
wrapper that would read from a string
. Which is not possible with OCaml.
PPS: I guess some folks here might have an opinion on my handling of UTF-8.