Lwt_unix.read from a large file and scheduling


From an LWT documentation I got an impression that the library always tried try to do as much work as possible. Essentially scheduling only happens when things block. As on Linux reading from a file never blocks, I assumed that when processing a large file I would need to add explicit Lwt.pause calls between reads from the file. This is to ensure progress of other promises. However, a test revealed that this is not the case. For example, consider the following code:

let test_file_read path = Lwt.(
  let read_fd fd =
    let buffer_size = 4096 in
    let buffer = Bytes.create buffer_size in
    let rec read_chunks sum =
      Lwt_unix.read fd buffer 0 buffer_size >>= fun nread ->
        if nread == 0 then return sum
        else read_chunks (sum +. (float_of_int nread))
      read_chunks 0.0 >>= (fun total ->
      Printf.sprintf "Total read: %.3f MB" (total /. (1024.0 *. 1024.0)) |> Lwt_io.printl
  Lwt_unix.(openfile path [ O_RDONLY; O_NONBLOCK; O_CLOEXEC ] 0) >>= fun fd ->
  finalize (fun () -> read_fd fd) (fun () -> Lwt_unix.close fd)

let timer n = Lwt.(
  let rec tick n () =
    if n <= 0 then return_unit
      Lwt_io.printl (string_of_int n) >>= fun() ->
      Lwt_unix.sleep 10.0 >>=
      (tick (n - 10))
  tick n ()

Here test_file_read path read the file from the given path and prints the total number of bytes read when done. timer function prints each 10 seconds a timer ticks until the timer expires.

My expectation was that when I run
Lwt.join [test_file_read "/var/tmp/zeros"; timer 60]
I would always see first a message about the number of read bytes and only then the timer and its ticks would show up. However in utop the output was:

utop # Lwt.join [test_file_read "/var/tmp/zeros"; timer 60];;
Total read: 11718.750 MB
- : unit = ()

So the file read code processing 11MB file and the timer did run in parallel without any efforts from my part. Why it is so?


As far as I know, reading from a file on Linux can block. Can you provide more information about this statement?

When Lwt has to perform an operation that can block, it runs the operation in a worker thread, to prevent it from potentially blocking the whole process. Each such operation gives an opportunity for Lwt to run some of your other code, such as the timer. Reads from ordinary files can block on Unix platforms, due (partly) to the lack of a non-blocking API for reading them.

Sockets and pipes have a non-blocking reading API, which Lwt uses. You could potentially starve the timer reading from a socket or pipe, though it’s probably not likely, since the writer on the other side would have to be writing as quickly as your process is able to read.

I was not precise with that statements. I mean that on Linux the read system call for files behaves as if there is always data available for file descriptor to read even if the kernel thread has to wait for a potentially long IO even for files opened with O_NONBLOCK.

After more experimentation I see that I can trigger the starvation with the above code if I call Lwt_unix.set_default_async_method Lwt_unix.Async_none. So I really have to use yield as >>= alone is not does not provide a scheduling point that runs other promises, right?

@fpoling, what you are describing sounds a lot like blocking. To get such a behavior, the user’s call to read will have to avoid returning until there is either at least one byte available, or the file is closed. O_NONBLOCK has no effect on regular files on Linux, AFAIK. It is only meaningful for sockets and pipes (and perhaps a few other things).

If you change the async method to none, IIRC you are causing Lwt to run the read call directly in the main thread, so it is now doing blocking reads without using a worker thread. This commits your main thread, which is running Lwt, to calling read, being forced by the system to wait if necessary, and running only at the speed that the system is able to return data to you. You shouldn’t do this, but yes, if you do, you should call yield manually. And yes, >>= does not currently provide a scheduling point for running other things, although that may change slightly in the future (https://github.com/ocsigen/lwt/issues/329). actually, it won’t change in any way that is meaningful for this topic.

and running only at the speed that the system is able to return data to you.

Due to the read look ahead on Linux there is a sweet spot for the size of the read buffer (typically around 8K-32K) when after the read call the kernel reads data from the disk after the read to populate the cache when the user process is busy doing other things. With this setup the read system call just copies the data from the cache which is very fast. This is very fragile and system/task dependent and should only be used after extensive benchmarking. Still it is nice that Async_none exists as per-thread option allowing to tune the performance when necessary :slight_smile:

Fair enough. If you know well what your exact scenario is, then perhaps you can guarantee a very low probability of blocking :slight_smile: In general, though, there are extreme cases like NFS, where read can block for an unpredictable amount of time.

Besides Async_none, you can also mess with this on a per-fd basis. If you call

Lwt_unix.set_blocking ~set_flags:false fd false

it will tell Lwt that the file descriptor is in non-blocking mode, without Lwt making any effort to actually set O_NONBLOCK on it (not that it matters for regular files). This will cause Lwt to run I/O on that file descriptor directly in the main thread. I don’t claim this is a good idea, or future-proof – but it is there :slight_smile: