I just wanted to share a fun result from a project I’ve been hacking on. ocaml-xsk is a binding to AF_XDP interface of libbpf.
AF_XDP is an address family in Linux for high-performance packet processing. With an AF_XDP socket a packet bypasses most of the kernel networking stack and is passed directly to userspace program. Depending on the configuration packets can be passed from the NIC without any data copies on either Rx or Tx. If you’re interested in this kind of stuff here are a couple very useful resources:
The cool part is that without installing large dependencies like DPDK you can get packets into your program basically as fast as your NIC can provide them! It turns out this is true even if your program is written in OCaml. Using ocaml-xsk I could receive or transmit 64 byte UDP packets at 14.8M packets per second. This is the limit for a 10Gb/s NIC.
I’m still trying to figure out the best interface for AF_XDP. There are several resources to manage, and simple receive and transmit operations actually require a few steps. But it’s encouraging know OCaml doesn’t get in the way of packet throughput.
Mark Hayden wrote a PhD thesis back in 1997 about how to use Ocaml (well, at the time, caml-light) to write high-performance network drivers. I can only say good things about his system, Ensemble. I used about 10 yr ago (maybe 2010? 2011?) to write Infiniband (DDR and RoCEE) programs against ibverbs, with similar performance to what you got: basically, the network hardware was the limiting factor.
Yep. That’s the one. If you have difficulty finding the code, I can upload the newest copy I have (which is still pretty old). When I was working with IB, I literally ported the -lowest- layer of his code onto ibverbs UD (“Unreliable Datagram”) and the rest of Ensemble just worked. Since ibverbs involves pinning memory, I created an O-O abstraction for the “iovec” management code, so that I could substitute in ibverbs-managed pinned memory instead of malloc-ed memory.
Sadly, those changes are lost in time, b/c owned by IBM, ah well. But it was really sweet, and Ensemble gives you full virtual synchrony in addition to various other communication modes.
ETA: Upon reflection, it seems like one might ask why I used objects instead of modules to provide the abstractions for substituting in IB-managed pinned-memory instead of malloc-ed memory. At the time, I was unsure of precisely how wrapping all of Ensemble in a functor would affect codegen, and since the code-paths for the lowest level of iovec management were all rarely-used, I thought it best to not disturb the upper layers of Ensemble.
Sure thing, though I’m afraid I’m not using these sockets for anything concrete at the moment, just messing around with them and XDP to get a feel for it. An AF_XDP sockets + userspace TCP stack would be super cool, though I think I’ll leave that to someone else However reading through your implementation I’m considering attempting a port to Rust. I shall let you know if anything comes of that, and thanks again!
If you could share the Ensemble source I’d appreciate it. I tried to dig it up from the Wayback Machine archive of the old Cornell site. It looks like the actual download links don’t work.
I’m mostly curious about the event and message passing architecture they used for passing data and control information between protocol layers. I imagine you’d want to minimize the memory allocation for new events, since they happen with each new message. I wonder how they managed that. Seems like the iovec structure might be analogous to what Jane Street built with their Iobuf module. And one could build zero-allocation event tuples on top of something like Jane Street’s Tuple_pool module.
Anyway, if you have the source I’d be much obliged.
Sure, I’ll have a look and put it up on github. In the meantime, for many of your questions, it might be a good thing to read his PhD thesis (or his Ensemble paper). I think the thesis is actually really great, b/c it carefully explains why he chose the memory-management scheme he chose.
BTW, Ensemble is a descendant of Horus, which itself descends from ISIS. Those systems all supported full virtual synchrony and more. Which is … a whole lot more than TCP. So you’ll find that there’s a lot of complexity to the messaging model, that you aren’t expecting … because it’s designed to do a whole lot more.
It’s the sort of toolkit you’d want, if you were going to implement Google’s Chubby.
Very cool! If you’re interested in hooking a full TCPIP stack to it, then you can just implement the mirage-net module signature:
This would wrap around your XDP bindings and use Lwt for the synchronisation layer. At that point, you can use the mirage-tcpip library with your Netif_xdp to send and receive TCP/IP. The cohttp can work directly with the TCPIP library to transmit HTTP, and the ocaml-tls stack supports it all as well.
Most of the systems are pretty straightforward in this regard: they just use Bigarray to wrap externally allocated memory. The cstruct library does this for the Mirage ecosystem. Whether or not you need zero-allocation is up to the usecase.
It’s been 11 years since I last got Ensemble working with ocaml, and I know that back then, I had to make some fixes. Unfortunately, the code is … lost to the mists of time, as I did that when I worked for IBM. I’ll see if I can get the code working again, but it might be a little while (gotta get the knowledge off tape backup from the vault under Yucca Mountain, it’s been so long ;-).
Are you aware of “rsocket”? It’s a C/C++ layer on top of ibverbs, that gives you something that is nearly-indistinguishable from TCP (but pretty close to the performance of Infiniband R(eliable)C(onnected)). It might be worth asking-around on the ibverbs and rsocket mailing-lists, whether there’s been any work to port the ibverbs APIs to AF_XDP. It would mean writing more code to implement the RC layer, but I have to believe that the ibverbs folks aren’t going to just leave that possible target for ibverbs untapped.
The relevance would be that, if somebody ports ibverbs to AF_XDP, then via rsocket you could get high-performance nearly-TCP sockets “for free” – just lift 'em up into OCaml via FFI.
ETA: Clarification: rsocket is (nearly-)API-compatible, but not wireline-compatible. So it’s not as if you can have a standard TCP socket on one end, and an rsocket on the other.
This sounded like a neat project. I’ve been messing around a bit with creating a netif-xsk module here. It’s very much a work in progress, but I’m at the point of trying to hook it up to a direct stack. I’m not quite sure where to start with building a direct stack around a netif module. Do you have an idea of the best examples or documentation to get started with that?
At this point, you will have a main.ml file that contains all the functor applications required to hook up the Netif through to the TCPIP stack (going via Ethif, ARP, IP, etc).
You can just cut and paste from main.ml and remove some of the autogenerated keys (which Mirage uses to configure things) and just hardcode those into a testcase for your xdp layer.
Once you’re happy with it, we can add direct configuration support into the mirage tool (so it would sit alongside mirage-net-unix as a mirage-net-xdp at https://github.com/mirage/mirage/blob/master/lib/mirage/mirage_impl_network.ml#L15). This will let mirage configure --net=xdp just work for anyone building Mirage apps on Linux. This might also be a useful option to integrate into Solo5 directly, so that we don’t have to use tap devices to get data frames.
I’m taking another crack at this. Using you suggestions I’ve been able to assemble a TCPIP stack on top of my netif device. Thanks for the pointers!
I want to make sure I understand the semantics of the Netif.listen function. In the template main.ml that’s generated from mirage configure --net=direct there are several invocations of the Netif.listen function. For instance, one from calling Tcpip_stack_direct.connect and another from calling Mirage_stack.listen. Should each invocation of Netif.listen yield an identical stream of packets that arrive at the network device? Or are the packets passed to the correct parts of the stack no matter which invocation of listen grabs the packets from the network device?