Load Balancer on FPGA - a Hardcaml Project

Hi,

My name is Jakub and I am a CS student. I would like to share with you a Hardcaml project that I have been working for the past few months, which is a network load balancer. It has no practical usage, but I decided to publish it because of (hopefully) educational value.

Link to the project: https://github.com/xThaid/fpga-lb

When I was learning Hardcaml, the biggest barrier for me was the lack of existing projects. It would be extremely useful to take a look at such a project to get to know some techniques, tricks, or how a big project should look like in general. Therefore, I post this with hope that somebody will find it useful. Hardcaml seems to be a really interesting alternative to other hardware description languages and I wish it had broader recognition.

Brief documentation of the architecture and more comments about the project are available in the repo.

Please let me know if you have any comments!

Thanks,
Jakub

17 Likes

@jurb thank you very much for sharing, reports like yours are extremely valuable.

And if you like stickers, get some OCaml Stickers via 🐌 … going on!

This is very cool.

Interested in what sort of overall latency, including the MAC/PCS layer, you see? Also, could you see a way to increase the throughput of the data plane to over 10gpbs?

I am glad you liked it.

The theoretical overall latency (that is calculated based on information from datasheets of MAC [0] and PHY [1], because I didn’t have equipment to perform measurements) is:

  • PHY RX – 200 ns
  • MAC RX – 102 cycles @ 100 Mhz
  • The load balancer – 30 cycles @ 100 Mhz
  • MAC TX – 40 cycles @ 100 Mhz
  • PHY TX – 80 ns

In total ~2 microseconds.

The throughput is right now obviously limited by 1Gbit PHY and MAC components. However, assuming that they don’t impose any limits the throughput would get to about 4Gbit/s. I haven’t tested that, but I think I could easily increase the width of the data path from 32 bits to 128 bits (or even beyond). At the cost of a few cycles I could also cut some data paths to increase the clock frequency to 150 Mhz. That gives ~20Gbit/s.

There is also a natural way to scale the load balancer horizontally. Since processing of a single packet is context-free, many dataplanes could be put on a single FPGA and the load could be distributed among the processing units. So this would multiply the throughput by the number of such units. On the FPGA I had (rather a low end one), there would be a place for 4 balancer+MAC sets.

But these are only theoretical calculations and I may not be aware of some limitations that cap the throughput at lower number.

[0] Triple-Speed Ethernet Intel® FPGA IP
[1] MARVELL Alaska 88E1111 Datasheet

1 Like