My name is Jakub and I am a CS student. I would like to share with you a Hardcaml project that I have been working for the past few months, which is a network load balancer. It has no practical usage, but I decided to publish it because of (hopefully) educational value.
When I was learning Hardcaml, the biggest barrier for me was the lack of existing projects. It would be extremely useful to take a look at such a project to get to know some techniques, tricks, or how a big project should look like in general. Therefore, I post this with hope that somebody will find it useful. Hardcaml seems to be a really interesting alternative to other hardware description languages and I wish it had broader recognition.
Brief documentation of the architecture and more comments about the project are available in the repo.
The theoretical overall latency (that is calculated based on information from datasheets of MAC  and PHY , because I didn’t have equipment to perform measurements) is:
PHY RX – 200 ns
MAC RX – 102 cycles @ 100 Mhz
The load balancer – 30 cycles @ 100 Mhz
MAC TX – 40 cycles @ 100 Mhz
PHY TX – 80 ns
In total ~2 microseconds.
The throughput is right now obviously limited by 1Gbit PHY and MAC components. However, assuming that they don’t impose any limits the throughput would get to about 4Gbit/s. I haven’t tested that, but I think I could easily increase the width of the data path from 32 bits to 128 bits (or even beyond). At the cost of a few cycles I could also cut some data paths to increase the clock frequency to 150 Mhz. That gives ~20Gbit/s.
There is also a natural way to scale the load balancer horizontally. Since processing of a single packet is context-free, many dataplanes could be put on a single FPGA and the load could be distributed among the processing units. So this would multiply the throughput by the number of such units. On the FPGA I had (rather a low end one), there would be a place for 4 balancer+MAC sets.
But these are only theoretical calculations and I may not be aware of some limitations that cap the throughput at lower number.