Using Ocaml to write network interface drivers

Is OCaml still have LLVM binding off the shelf?
It can be used to kick up the speed using custom code generation in JIT mode.

It is - see the corresponding opam package.

As an aside, you might be interested/amused to read about the Ensemble network protocol system, written by Mark Hayden at Cornell back in 1997. Back then the native-code compiler was … experimental (IIRC) and he still achieved impressive results. There’s a chapter on the techniques he used, and it’s quite educational – managing memory (not leaving it up to the GC) was very important, and he did a lot of work to make it tractable.

The protocols he implemented were “full virtual synchrony”, which is far more involved than TCP, and his code was fully competitive with an existing C implementation.

1 Like

Reference for the curious: Mark Hayden’s PhD thesis, https://ecommons.cornell.edu/handle/1813/7316, Chapter 4.

A very good and interesting read. Thank you @Chet_Murthy.

Given Chet’s claim that it beat an existing C implementation, I was wondering whether it was another one along the lines of the meme “I can beat C with (insert your favourite functional PL) if I restrict myself to this tiny dialect that looks nothing like functional programming!". Not at all! These are all techniques that you have already read about on this discuss. Summary:

  1. Good latency by allocating very little and avoid promoting in the fast path.
  2. To avoid allocating, use inlining to avoid the allocation of closures for higher-order iteration, etc.
  3. Avoid allocations also by preallocating closures outside the fast path.
  4. Use inlining also to remove the cost of abstractions.
  5. Allocate buffers manually outside of the OCaml heap in arenas (in C) and manage them manually with reference-counting.

The last example is less common. Messages have similar lifetimes, so this avoids that an arena is kept alive by a single message. When doing zero-copy of messages, he reports that managing buffers with the GC causes fragmentation issues and 25% of time spent in the GC for moderate workloads. The figure shows that the benefits of doing zero-copy are negated for medium-sized messages. Only manual reference counting keeps a low latency whatever the message size. @kayceesrk mentioned a similar problem to me some time ago but I was missing a good reference until now.

The discussion on the suitability of ML is thorough and well-written.

Hayden mentions that several improvements to OCaml were made as a response to his requests.

3 Likes

Yes, #5 was the real killer. Two further thoughts:

(1) Mark went on to rewrite Ensemble in C, and get similar performance. He observed that without having written it in ML, it would have been much, much, much harder to write C-Ensemble.

(2) This issue of “explicitly manage buffers” is a really important one.

For decades (at least, since 1986, when I first noticed), GC jocks have been promising us (in LISP, Scheme, Standard ML, Smalltalk, Java) that the NEXT GREAT ALGORITHM would make it unnecessary to explicitly manage memory (in GCed languages). Literally they’ve been doing this dance for >30 years. And it has NEVER worked-out. [But hope springs eternal … NEXT time …] Why? [rank conjecture here] Machines get faster, memories get bigger, cache-sizes don’t keep up, TLBs don’t keep up. Maybe other things. It’s true that as machines get faster, the class of applications you can solve without “working really hard” grows. But for ultimate peformance, the recipe hasn’t change, and it sure includes managing the memory lifetime of your most important data-structures.

BTW, this isn’t confined to GC & GCed languages. For example, it’s been the case that if you’re really going to do serious transaction-processing, and you care about latency and performance, you’re going to use what amounts to Infiniband (aka “RoCEE”). The (Linux) kernel jocks may be doing incredible things in the TCP stack, but if you really care, you’re going to use a network card that allows you to bypass the kernel entirely, pin physical memory, do everything using polling, and avoid copies wherever possible. And this has been true since Infiniband was invented … for the transaction-processing hardware&software built by Tandem in the mid-80s.

2 Likes