Semantics of in_channel, out_channel, when two processes operate on the same file

Suppose I have a file that I initialise with n bytes of data (eg “hello world!”). I set up a process which opens the file via an in_channel, and repeatedly reads n bytes from the beginning of the file. Whenever the bytes change, the process should print out the new bytes.

I now start a second process, which opens the file via an out_channel. It writes over the initial bytes, with some new bytes (eg “world hello!”), then closes the out_channel and terminates.

Naively I expect the first process to print out the new bytes. However, it doesn’t, presumably because the in_channel is buffered, so the first process just repeatedly reads the buffered old data without realizing that the data in the file has changed.

Is this intended behaviour? and if so, is there a way to force the in_channel to read from the underlying file descriptor? alternatively, should I be using standard Unix IO instead?

Here is some code that exhibits the behaviour: http://ix.io/3HcX

I tend to think this is expected given the buffering behaviour of in_channel. If you don’t want buffering indeed Unix IO could be used. There is no way to control the buffering behaviour of in_channel at the moment.

Cheers,
Nicolas

2 Likes

Ok, but I still think this is a bit strange!

Do you have any alternative semantics in mind?

Cheers,
Nicolas

I think unbuffered IO would be a reasonable default, since you can build buffered IO on top, but not vice-versa. I also think that the behavior of in_channel is a bit strange, particularly since there isn’t a way to reset the buffer.

1 Like

I’m not sure I agree here. Unbuffered IO is an easy way to kill the performance of your program, and buffered is the right default in most use-cases. If you need unbuffered IO you can use Unix, so all bases seem to be covered (albeit with a slightly different API), no?

Having an option to turn off buffering may be something to consider though (a way to turn off buffering for output channels was recently added). https://github.com/ocaml/ocaml/pull/10538

One way is simply to close and re-open the in_channel. Having a way to turn off buffering in the in_channel as said above could be another.

Cheers,
Nicolas

1 Like

I don’t mind the buffering, but a general rule of buffering is that if the underlying data changes, the buffer should detect this and reset itself accordingly. I realise this is difficult to do in this situation (because there is no reliable way to detect that the underlying data has changed short of re-reading the data - an unfortunate failure of filesystem design in my opinion), but still, that is why the semantics is confusing: “When using an in_channel, attempting to read from a given position may result in arbitrarily old data being returned (due to the existence of a buffer containing stale data).” - if this is really the semantics we should be upfront about it.

Also, at the risk of pushing my luck and annoying even more people… Can I repeat my oft-made request to have pread and pwrite included in the Unix library? Given that multicore is just round the corner, pread and pwrite have substantial advantages over seek+(read/write), in that the user doesn’t have to bother to implement their own locking in order to get correct code. This potentially makes using pread and pwrite more efficient than using seek+(read/write), as well as simpler.

Probably worth opening an issue for this Issues · ocaml/ocaml · GitHub. And PRs are always welcome… :slight_smile:

Cheers,
Nicolas

2 Likes

A note to this effect in the documentation seems like a good idea. You can open an issue at Issues · ocaml/ocaml · GitHub or, better yet, propose a PR in this direction directly.

Cheers,
Nicolas

1 Like

It is interesting, but is it really a good idea? Off the top of my head I cannot recall coming across languages which provide a buffered file abstraction for input (whether called channels, streams, ports or whatever) which automatically refills buffers every time a different process writes to the file which is being read. If you are doing that, aren’t you likely to be working in the weeds, using POSIX read() and write() (or pread() and pwrite()) to begin with, supplemented if necessary with locking?

Edit I have seen arrangements (I have implemented one myself in a different language) whereby setting a file pointer on a seekable file to a position earlier than its current position will vacate the buffers, so causing a new read to refill them, which would I guess partly do what the OP wants for their particular use case. A comment for Stdlib.seek_in that this does not necessarily happen with in_channel could perhaps be useful. It wouldn’t help with locking though if there is a contest with another process.

1 Like

Maybe default is the wrong word. I agree that most programs should use buffered IO. But, Unix isn’t portable and channels are (I think?). It would be better to have a portable api that can do buffered and unbuffered IO.

In what sense do you mean? Unix is POSIX-ish and does a pretty admirable job on both Linux and Windows.

Cheers,
Nicolas

I was just going by this page Portability – OCaml, which says that Unix support is partial on Windows unless you run under Cygwin. Although it does seem that the IO functions are portable.

I guess I am confused about what a channel (say, an input channel) is supposed to be. Naively I imagine a potentially infinite seq of bytes. But then there are methods like pos_in, seek_in etc which look a bit like operations on a file descriptor. Then there is the fact that the seq is backed by a file, which opens up the possibility of concurrent modification, the semantics whereof was what started this discussion. I suppose we can view the “pos” and “seek” methods as a bit infra dig, and then say that the implementation on top of a file is just a detail, and for channels to work well we should never write over existing data in the file etc. With these restrictions I can agree that the idea of a channel is a bit clearer.