Buffered IOs and performance

I’ve been experimenting a bit with buffering done in OCaml, versus the builtin buffering of in_channel, prompted by this discussion

Roughly, it gives this:

### buffered
hyperfine --warmup 2 './run buf f_500M'
Benchmark 1: ./run buf f_500M
  Time (mean ± σ):      94.9 ms ±   5.9 ms    [User: 15.5 ms, System: 79.4 ms]
  Range (min … max):    88.9 ms … 107.5 ms    28 runs

### unix
hyperfine --warmup 2 './run unix f_500M'
Benchmark 1: ./run unix f_500M
  Time (mean ± σ):      96.2 ms ±   4.6 ms    [User: 18.3 ms, System: 77.8 ms]
  Range (min … max):    88.0 ms … 104.5 ms    28 runs

### by char
hyperfine --warmup 2 './run char f_500M'
Benchmark 1: ./run char f_500M
  Time (mean ± σ):      3.148 s ±  0.033 s    [User: 3.084 s, System: 0.061 s]
  Range (min … max):    3.095 s …  3.204 s    10 runs

where “buf” is using standard in_channel, “unix” is using a Unix.file_descr+ a local buffer, and “bu char” is using in_channel but reading char by char with input_char. It’s a bit surprising that it’s that much slower.

What I find surprising is that in_channel + input is faster than using Unix fd and a single OCaml buffer. I thought it would be interesting to maybe remove some C from the runtime some day, but apparently even double buffering doesn’t impact performance. Can anyone comment on why “buf” is slightly faster than “unix”?

Code and results are here:

4 Likes

Nice, thanks for the experiment! What is the size of the OCaml buffer during benchmarking? Is it set using BUFSIZE or is it the hardcoded 64K ?

Cheers,
Nicolas

1 Like

Just a wild guess here, but perhaps it is due to the fact that on the C side, “buf” uses a single buffer that gets allocated once when you open the channel, while “unix” allocates a different buffer for each call to read (on the stack, but still…)

Cheers,
Nicolas

2 Likes

how do you guess would mmap fit into the picture?

“unix” looks about the same performance as “buffered”, no? range is better; mean is slightly worse but perhaps the input_channel avoids the odd syscall or two depending on buffer size etc.

BTW thanks for doing this! Interesting results.

I tried first with a 4kb buffer but it showed on strace that in_channel was using 64kb, so I uniformized. It all runs with 64kb now.

They look to be the same performance but my intuition was that unix should have been faster. @rgrinberg pointed out on IRC, like @nojb here, that Unix.read allocates a C buffer anyway, when I was expecting it to directly write into the passed bytes. Apparently the reason is that unix releases the runtime lock, so it can’t count on the bytes buffer to stay at the same place in memory.

1 Like

Ah! I didn’t realise. Thanks for explaining to me. So perhaps a version of Unix.read with a bigstring buffer would show the speedup (Jane St. libs?).

1 Like

I think that you will see a big speedup if you recode “by char” to use Unix.read + a buffer.

Cheers,
Nicolas

This is an issue I’ve also had trying to write efficient C primitives for OCaml. Being used to .NET, I expected the OCaml runtime to have an interface to pin/unpin blocks so they can’t be moved by the GC, which would save copies back and forth, although blocks that stay pinned for “too long” should probably be promoted to the major heap beforehand.

Would it be feasible to add block pinning to the OCaml runtime (once the feature-freeze is over)?

(2 little cents)

A decade ago we did some buffer-size Vs filesystem benchmarks with @ashish (the optimal buffer-size varied wildly wrt the underlying filesystem but we were also trying distributed ones)

But we also found that https://ocsigen.org/lwt/5.5.0/api/Lwt_io provided the best flexibility for those benchmarks → see the arguments ?buffer: Lwt_bytes.t; might still be worth a try :slight_smile:

2 Likes