Fast data processing using pipes and Lwt_stream

Hi, I’m writing a progam for processing real time data. I’m targeting 30 MB/s in real time over a prolonged period (not sure if feasible). It’s going to do pretty simple analysis of the stream of data but I’m not even there yet. The data is going to come from dumpcap. The simplest approach I could come up with is using named pipes so that I will use something like dumpcap -i any -o /tmp/target udp (where /tmp/target is a named pipe/fifo).

Now, here’s a naive implementation of my “reader” program (counter is not a real example):

let%bind () = Lwt_unix.mkfifo "/tmp/target" 0o700 in
let%bind in_channel = Lwt_io.open_file ~flags:[Unix.O_RDONLY] ~mode:Input "/tmp/target" in 
let stream = Lwt_io.read_chars in_channel in
let counter = ref(0) in 
let%map () = Lwt_stream.iter (fun _ -> incr counter) stream in
print_int !counter;

… and it’s painfully slow. I understand it’s all context and hardware dependent but piping 10 MB from /dev/zero to the program on an M1 Mac takes around 1 second (while head -c 10000000 /dev/zero > /dev/null takes <0.01s). Where should I look for profiling and which techniques should I learn about? How to get some intuition when it comes to expectations about performance in such cases?

Since you seem to be on macOS. Fire up choose “Time profiler” and run your executable with it. You can then get a feel of the heavy functions which will be easy to recognize from the reasonably evident name mangling scheme.

The UI is confusing, all the documentation is outdated, but it’s possible to get to it. Also at some point I managed to run the sampling directly from the cli, but a few week ago I didn’t succeed.

Don’t use abstractions before you need them :–) I don’t know your larger context but why don’t you simply first try with a bare Unix program to see the kind of throughput you can get.


I’ll second @dbuenzli and say that you may not need lwt or anything fancy.

I wrote a tiny test case: small IO benchmark · GitHub

and running it on my laptop gives me:

read 500000000 bytes from "/dev/zero"
input contained 500000000 zeroes
took 3.3998s (147.0675 MB/s)

There is no fifo involved, so perhaps it’d be slower overall, but that gives you a ballpark for just reading from a in_channel into a fixed-size buffer (64 KiB).

edit: to measure perf on linux, I would use something like this:

$ strace -c ./_build/default/bin/main.exe /dev/zero
read 500000000 bytes from "/dev/zero"
input contained 500000000 zeroes
took 1.6951s (294.9611 MB/s)
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.004008           0      7632           read
  0.00    0.000000           0         2           write
  0.00    0.000000           0         4           close
  0.00    0.000000           0         4         3 lseek
  0.00    0.000000           0        15           mmap
  0.00    0.000000           0         5           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         5           brk
  0.00    0.000000           0         3           rt_sigaction
  0.00    0.000000           0         4           pread64
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           readlink
  0.00    0.000000           0         3           sigaltstack
  0.00    0.000000           0         2         1 arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         4           openat
  0.00    0.000000           0         4           newfstatat
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         1           prlimit64
  0.00    0.000000           0         1           getrandom
  0.00    0.000000           0         1           rseq
------ ----------- ----------- --------- --------- ----------------
100.00    0.004008           0      7696         5 total

This measures system calls, which are critical for IO (I’m sure macOS has some equivalent).


As others have said, get rid of Lwt and see where you can get.

To give a bit more context on Lwt and why you probably shouldn’t be using it in your application:
In essence, Lwt simply makes sure that your program keeps busy whilst you are waiting for I/O.

As such it is primarily useful for programs that need concurrency and especially that do potentially a lot of concurrent I/O calls. Think:

  • a webserver which needs to keep answering client calls whilst waiting for some database lookup to return, or
  • a p2p node which sends and receive gossip messages whilst exchanging application-level data with a given peer.

This doesn’t seem to be the kind of program you are writing. And if you have performance constraints, then, as mentioned by others, you should get rid of abstractions you don’t need.


To be fair, the speedups here are mostly because of switching from processing 1 character at a time to processing a chunk of bytes at a time. Using Lwt in your example and processing 64kb chunks a time gives me more or less the same throughput, and switching your example to processing 1 character at a time as the original post slows it down significantly :slightly_smiling_face:

Edit: Not saying that Lwt should be used for this kind of problem since I’m not sure it’ll benefit much here, but using Lwt for this problem shouldn’t be that much slower (if at all) than using the Unix library. Buffering strategies, and processing strategies (processing 1 item at a time vs chunks) will most likely have a lot more impact than lwt vs plain unix.


I have to bite here. Where does this come from? It doesn’t exist on my Big Sur with Xcode CLI tools.

I removed Xcode itself because so far all it seemed to be good for was gobbling bandwidth with its huge updates.


I’m realizing it now about the buffer size. That’s what I was alluding to wrt getting some intuition. Going abstraction less was very helpful tip. What I pasted is obviously just a subset of a larger program :slightly_smiling_face:.

I was using Lwt for unrelated reasons but I might change the architecture a bit and skip the requirement for concurrency.

I completely forgot about the fact that using standard profiling tools is a viable way to profile OCaml and I would definitely see the bottlenecks in this case. FWIW is a part of the Xcode distribution (not CLI, you have to download Xcode itself). It’s a pretty decent set of tools, I used to use it extensively when I worked on macOS software in the past.

Note that there used to be an instruments cli tool but it seems to have vanished (at least on my machine). Also you still need the app to look at the traces.

I wish firefox’s profiling console could load them – apparently it knows how to read perfs profiles.

I don’t remember how I found this, but there is an xctrace tool available on macOS that can be used to generate the trace files using the cli.

xcrun xctrace record --template 'Time Profile' --launch -- <your app>

The list of templates can be seen via xctrace list templates, and the manpage has a lot more details about the various cli options available for the tool.


Not in mine, sadly. It’s probably installed as part of Xcode as well.

To finish the topic off: we were able to make the app really fast. We did not scientifically benchmark it on our hardware because it became IO bound in simple cases like cat file.pcap | our_program. On my machine in particular it maxed out at 1.5 GB/s including processing and analysis of the pcap packets.

It gave me quite a bit of intuition when it comes to the overall performance. As @anuragsoni mentioned, the main boost comes from using larger buffers.

Although besides the initial scope of the post, we were able to avoid copying thanks to Unix Bigstring modules from Core (which includes facilities for reading from a fd directly into a bigstring, the underlying implementation is very simple though) and then interfacing with this data using Cstructs and then parsing the interesting bits with angstrom. Together it’s a very powerful combination.

If there’s one takeaway from this thread it’s that it’s not too hard to build very performant applications in OCaml and if you ever get stuck, just try to establish the baseline first. I used higher level utilities and there were too many variables involved. Instead I should have tried to establish the baseline. In my case it’d have been a simple program that reads from stdin and just reads data into a buffer.