Load Balancer on FPGA - a Hardcaml Project


My name is Jakub and I am a CS student. I would like to share with you a Hardcaml project that I have been working for the past few months, which is a network load balancer. It has no practical usage, but I decided to publish it because of (hopefully) educational value.

Link to the project: https://github.com/xThaid/fpga-lb

When I was learning Hardcaml, the biggest barrier for me was the lack of existing projects. It would be extremely useful to take a look at such a project to get to know some techniques, tricks, or how a big project should look like in general. Therefore, I post this with hope that somebody will find it useful. Hardcaml seems to be a really interesting alternative to other hardware description languages and I wish it had broader recognition.

Brief documentation of the architecture and more comments about the project are available in the repo.

Please let me know if you have any comments!



@jurb thank you very much for sharing, reports like yours are extremely valuable.

And if you like stickers, get some OCaml Stickers via 🐌 (2022)

This is very cool.

Interested in what sort of overall latency, including the MAC/PCS layer, you see? Also, could you see a way to increase the throughput of the data plane to over 10gpbs?

I am glad you liked it.

The theoretical overall latency (that is calculated based on information from datasheets of MAC [0] and PHY [1], because I didn’t have equipment to perform measurements) is:

  • PHY RX – 200 ns
  • MAC RX – 102 cycles @ 100 Mhz
  • The load balancer – 30 cycles @ 100 Mhz
  • MAC TX – 40 cycles @ 100 Mhz
  • PHY TX – 80 ns

In total ~2 microseconds.

The throughput is right now obviously limited by 1Gbit PHY and MAC components. However, assuming that they don’t impose any limits the throughput would get to about 4Gbit/s. I haven’t tested that, but I think I could easily increase the width of the data path from 32 bits to 128 bits (or even beyond). At the cost of a few cycles I could also cut some data paths to increase the clock frequency to 150 Mhz. That gives ~20Gbit/s.

There is also a natural way to scale the load balancer horizontally. Since processing of a single packet is context-free, many dataplanes could be put on a single FPGA and the load could be distributed among the processing units. So this would multiply the throughput by the number of such units. On the FPGA I had (rather a low end one), there would be a place for 4 balancer+MAC sets.

But these are only theoretical calculations and I may not be aware of some limitations that cap the throughput at lower number.

[0] Triple-Speed Ethernet Intel® FPGA IP
[1] MARVELL Alaska 88E1111 Datasheet