I’ve been experimenting a bit with buffering done in OCaml, versus the builtin buffering of in_channel, prompted by this discussion
Roughly, it gives this:
### buffered
hyperfine --warmup 2 './run buf f_500M'
Benchmark 1: ./run buf f_500M
Time (mean ± σ): 94.9 ms ± 5.9 ms [User: 15.5 ms, System: 79.4 ms]
Range (min … max): 88.9 ms … 107.5 ms 28 runs
### unix
hyperfine --warmup 2 './run unix f_500M'
Benchmark 1: ./run unix f_500M
Time (mean ± σ): 96.2 ms ± 4.6 ms [User: 18.3 ms, System: 77.8 ms]
Range (min … max): 88.0 ms … 104.5 ms 28 runs
### by char
hyperfine --warmup 2 './run char f_500M'
Benchmark 1: ./run char f_500M
Time (mean ± σ): 3.148 s ± 0.033 s [User: 3.084 s, System: 0.061 s]
Range (min … max): 3.095 s … 3.204 s 10 runs
where “buf” is using standard in_channel, “unix” is using a Unix.file_descr+ a local buffer, and “bu char” is using in_channel but reading char by char with input_char. It’s a bit surprising that it’s that much slower.
What I find surprising is that in_channel + input is faster than using Unix fd and a single OCaml buffer. I thought it would be interesting to maybe remove some C from the runtime some day, but apparently even double buffering doesn’t impact performance. Can anyone comment on why “buf” is slightly faster than “unix”?
Just a wild guess here, but perhaps it is due to the fact that on the C side, “buf” uses a single buffer that gets allocated once when you open the channel, while “unix” allocates a different buffer for each call to read (on the stack, but still…)
“unix” looks about the same performance as “buffered”, no? range is better; mean is slightly worse but perhaps the input_channel avoids the odd syscall or two depending on buffer size etc.
I tried first with a 4kb buffer but it showed on strace that in_channel was using 64kb, so I uniformized. It all runs with 64kb now.
They look to be the same performance but my intuition was that unix should have been faster. @rgrinberg pointed out on IRC, like @nojb here, that Unix.read allocates a C buffer anyway, when I was expecting it to directly write into the passed bytes. Apparently the reason is that unix releases the runtime lock, so it can’t count on the bytes buffer to stay at the same place in memory.
This is an issue I’ve also had trying to write efficient C primitives for OCaml. Being used to .NET, I expected the OCaml runtime to have an interface to pin/unpin blocks so they can’t be moved by the GC, which would save copies back and forth, although blocks that stay pinned for “too long” should probably be promoted to the major heap beforehand.
Would it be feasible to add block pinning to the OCaml runtime (once the feature-freeze is over)?
A decade ago we did some buffer-size Vs filesystem benchmarks with @ashish (the optimal buffer-size varied wildly wrt the underlying filesystem but we were also trying distributed ones)
But we also found that https://ocsigen.org/lwt/5.5.0/api/Lwt_io provided the best flexibility for those benchmarks → see the arguments ?buffer: Lwt_bytes.t; might still be worth a try