Using Ocaml to write network interface drivers


#1

At 35c3 this year I learned about a project on implementing user space network drivers in different programming languages: ixy languages by Paul Emmerich and his students at Technische Universität München. The implementations include one written in Ocaml by Fabian Bonk https://github.com/ixy-languages/ixy.ml.

The topic is very interesting by itself. What’s more, it provides a very interesting performance comparison of programming languages in high-throughput use cases. Preliminary results for some of the languages are available on github: https://github.com/ixy-languages/ixy-languages.

Ocaml seems to perform on par with Haskell, but significantly worse than other GCed languages such as C# and Go.

Besides being a very interesting project, it would be great to know if anyone with a deeper understanding of Ocaml and its internals has some throughts on where the performance differences come from?


#2

That’s a really good initiative from the TU-Munchen group!

At a quick look, it’s probably just bytes bounds checking in the Cstruct layer. Just needs someone to profile the OCaml one and see where it’s spending its time. We have, at various points in the MirageOS lifecycle, gotten extremely competitive performance due to app<->kernel specialisation causing work to be saved (e.g. via memoisation).


#3

This is very interesting. I don’t know what our expectations should be on the packet rate, but I find the latency spike in the latency benchmark disappointing, and I would hope that it can be improved.

Unfortunately, I don’t have access to a machine with this hardware (the server-type machine that I have at hand either have Broadcom network cards, or Inter 82579, which I suppose is an older generation card), and it does not seem possible to run any of their benchmark without the hardware.


#4

Well, the first thing that rings the bell is that they are using Int32 for arithmetics. It’s a boxed type, so it should be extremely slow. Using Int63, or Zarith would be much more efficient.

Of course, you need to profile before making any decisions :slight_smile:

Also, allocating a bigarray is very slow in OCaml, and a well-written driver shall not allocate at all. To my experience, when I was writing such kind of drivers I was getting the same performance as a C program would get. But, of course, no allocations in the loop.


#5

Yeah that doesn’t look good. Another thing that profiling might surface is that network device drivers typically perform better on multicore when they can run rx and tx in parallel on separate cores. That OCaml uses a single CPU lock on the memory collector might be relevant to its comparison with other GC languages that do not. Not at all sure why the Haskell implementation is so slow, but without looking at it my first guess would be to look for whether it uses explicit strictness appropriately.


#6

Hello, author here!

Thanks for the interest!

I’m painfully aware of the slow int32 arithmetics and I’ve eliminated it in most places but unfortunately the NIC requires register accesses to be done 32 bit at a time. Creating an unboxed int32 type available only on 64-bit machines seems like a solution, but that way we couldn’t use the %caml_bigstring_{g,s}et32 primitives which might lead to optimization issues?

Bigarray allocations are only ever done during mempool creation which shouldn’t be a problem. I do allocate a new array for each batch of received packets though.

I’ve done some initial profiling with perf. I don’t actually know much about profiling so I just sort of looked through the hotspots and changed things around until it got faster. I haven’t looked into other profiling tools like spacetime for example.

Regarding the latency spike: I think that has something to do with major heap collections. I’ve observed some “oscillating” behavior, i.e. repeated slowdowns and speedups with a period of a few seconds that seem to coincide with major heap collections. I haven’t looked into this further; I just figured I’d mention it.

I thought about storing the packet metadata in the hugepage like the C driver (see my explanation) to avoid an indirection, though that doesn’t seem very idiomatic to me. The whole API is already very un-OCaml-y (manual memory management) and I really don’t want to make it any worse.

I’d appreciate any ideas or suggestions for improvement you may have! I’m also looking for resources on OCaml profiling.


#7

So, what’s the bottleneck at the moment?


#8

Reading the descriptor status bits took (and still takes) a lot of time. These two commits gave me ~20% better performance:



I need to check every descriptor’s DD (descriptor done) bit to see if the NIC is done writing packets to memory. For receive descriptors I need to additionally check that the EOP (end of packet) bit is set because I’ve not (yet) implemented jumbo frame handling. I’ve written documentation on exactly this.

I’ve experimented a bit with compiler explorer and noticed that the compiler seems to generate a lot of code for this fairly simple operation. I’m not sure if there’s a more efficient way of checking these bits.

I have not yet profiled the latest commit.


#9

Oh my, OCaml is not very good in arithmetic, especially in such bit-tinkering. Though if I understand correctly your binding, the vast majority of instruction concerns array bound checking or get16 stuff. Maybe you need to use something less safe as avsm had suggested.

Ah, since you already use the Cstruct, it should do it for you. The rest of the listing seems quite concise, are you sure that’s the only bottleneck?


#10

There are definitely more bottlenecks; those are just the most obvious ones since they should be very cheap operations. I think the array allocation in rx_batch is also quite expensive. Not sure how to avoid that though.

I really want to avoid writing more C code (my driver is the only one that still requires any C :frowning_face:). What would be the most idiomatic way to avoid bounds checking? Unsafe operations are sort of frowned upon in our project.


#11

Well, cstruct is already using get_unsafe for you. You also can check the bounds manually and then use get_unsafe functions.

I think the array allocation in rx_batch is also quite expensive. Not sure how to avoid that though.

I would start with looking at the gprof gist, it’s really pointless to say about measuring. I think you could do pretty much staff you do in C, Rust or C# code in OCaml as well, for example you could keep your packages as a list of Bigarray sub-buffers and copy them on request, use async io with Lwt/Async, etc. But first it would be productive to measure what causes the latency (sometimes the problem is so unexpected that you wouldn’t find regardless the profound knowledge of the technology). Maybe you could post a gprof or perf results here.


#12

I’m painfully aware of the slow int32 arithmetics and I’ve eliminated it in most places but unfortunately the NIC requires register accesses to be done 32 bit at a time. Creating an unboxed int32 type available only on 64-bit machines seems like a solution, but that way we couldn’t use the %caml_bigstring_{g,s}et32 primitives which might lead to optimization issues?

First of all, you can always emulate 2^32 modulus arithmetics using 2^63 modulus arithmetics, something like this:

let m = 1 lsl 32 - 1
let add32 x y = (x + y) land m [@@inline] 

This will be inlined even in the vanilla OCaml compiler and generate efficient code, with m precomputed. Even more, if you are not expecting any overflows (i.e., any values that are greater than 2^32-1, then you can omit the modulo operation, and just use plain OCaml ints).

You can use caml_bigstring_{g,s}et16u to access 32-bit values, by doing two reads and concatenation, e.g., for little endian:

let get32le data i = 
   let lo = get16 data (i+1) and hi = get16 data i in
   hi lsl 16 lor lo
[@@inline]

the same for put


let m16 =  1 lsl 16 - 1

let put32le data i x =
   put16 data i (x land m16);
   put16 data (i+1) (x lsr 16)
[@@inline]

Not using `"%caml_bigstring_get32u" is a boon, not a bane, as this primitive allocates memory and triggers the whole GC machinery so not only it creates garbage that will take time to collect, but it will also create a couple of basic blocks that will stall your CPU and stress the branch predictor.

Another tip, is that if you ever translate a bit value to bool via the translation from int to bool you should use Obj.magic for that, e.g.,

let unsafe_bool_of_int : int -> bool = Obj.magic
let int_of_bool : bool -> int = Obj.magic

The reason is that otherwise each bit-to-bool translation will be compiled something like this

if RAX == 3:
   RAX := 3
else
   RAX := 1
return RAX

So, your bit-twiddling will be translated into lots of branches, which is very inefficient.
Of course, even more efficient would be forfeiting any translation from bool to int and
relying on bitwise arithmetics.

To summarize, there are two killers of performance:

  1. allocations - performance critical code shall not allocate
  2. branching - the more you can do without relying on if/then/else the better

Also, do not forget to use the [@@inline] and [@@unboxed] attributes as well as consider trying flambda, it does magic :slight_smile: (sometimes it doesn’t, but this is the nature of magic).

Another tip is to to use different -d options of the compiler to control the generated code. I personally just prefer -S option and control it via the assembler to be absolutely sure.

Once you get rid of any allocations in the hotspots your performance would be on par with any C or Rust implementations.

Yeah that doesn’t look good. Another thing that profiling might surface is that network device drivers typically perform better on multicore when they can run rx and tx in parallel on separate cores. That OCaml uses a single CPU lock on the memory collector might be relevant to its comparison with other GC languages that do not.

Well, OCaml has the master lock only on its heap, all operations on the non-controlled memory could be done without any lock, so you can safely copy rx/tx buffers in OCaml in parallel. Of course, if your code allocates, then GC will stop the world, but this is another story. A driver shall not allocate. Even in C. Even with malloc.

Disclaimer: code snippets are not tested and provided for illustration purposes only. No guarantee of fitness or even type correctness :slight_smile:


#13

I replaced int32 with two Cstruct.uint16 = int in a few spots, though I could still change the TXD.reset function’s internals. I still need to do the register reads and writes 32 bit at a time; it says so somewhere in the datasheet iirc. The register file is not backed by memory but actually generates PCIe transactions on reads and writes. To be fair I never tried doing two 16 bit accesses there, so it might work anyway.

You can use caml_bigstring_{g,s}et16u to access 32-bit values

I’m guessing the u is for unsafe? Again, unsafe operations are sort of frowned upon. We’re trying to write idiomatic code, not tinker with the language internals. To me it seems like the compiler (especially at -O3) should figure out that bit-to-bool translation by itself, no?

I’m already using flambda; using the standard compiler the code is ~30% slower iirc.

I will play around with the [@@inline] and [@@unboxed] attributes. I’m not sure where I could use the [@@unboxed] one, though. I don’t think I have any single-field record types or single-constructor, single-argument types and I’m never calling C functions in the main rx/tx functions.

What are the -d options you mentioned? -dparsetree and so on? They don’t seem to be in the man page.

I’m curios, how would you implement the rx_batch function without allocations?

val rx_batch : ?batch_size:int -> t -> int -> Memory.pkt_buf array
(** [rx_batch dev queue] receives packets from [dev]'s queue [queue].
    Returns between [0] and [num_rx_queue_entries] packets.
    If [batch_size] is specified then between [0] and [batch_size] packets
    will be returned. *)

At some point packet buffers have to be passed to the user and I don’t think going the C route (having the user preallocate an array and then just filling in the array in the driver) is very OCaml-y.

How would I go about implementing non-heap operations in parallel? Would be cool to have that working, but I fear this won’t do much for the benchmark results, since we’re limiting all languages to a single core at 3.3 GHz or even 1.6 GHz.


#14

Well, cstruct is already using get_unsafe for you.

Is it? Are the %caml_bigstring_{g,s}et32 primitives unsafe?

Edit: turns out they are

I would start with looking at the gprof gist

Will do! Any resources along the lines of “gprof+OCaml for dummies”?

Maybe you could post a gprof or perf results here.

This might sound dumb, but what exactly should I collect and post? How much load should I generate? More than what the driver can handle? The raw perf dump is pretty huge; how should I process it?


#15

Here u stands for unsigned. Concerning their safety, it is a very different story. Any compiler primitive is an internal tool and is not for the casual user. First of all they are untyped, and of course, no bound checks are done. But later on it.

No, even with flambda and O3.

It is useful when you want to have an extra type safety without sacrificing performance,

type modulus = Modulus of int [@@unboxed]

Will have the same representation as int but it would be hard for you to confuse int with modulus, for example.

Yep, but not the -dparsetree, I was talking about -dcmm, -dlinear and lots of -d from flambda could be very insightful. (the -dparstree option outputs the source code as it is parsed, we’re interested in the code that is generated). See the output of ocamlopt -help for more options.

At some point packet buffers have to be passed to the user and I don’t think going the C route (having the user preallocate an array and then just filling in the array in the driver) is very OCaml-y.

I would have a pool of abstract iovec data structures, which will be bigarrays underneath the hood. They all will be preallocated during the driver initialization time, based on drivers parameters. The same as Linux kernel is doing, basically.

This will work out of the box, if you will do bigarray copying/blitting the primitives that will do this will release the runtime. So you can use plain OCaml threads which maps to native threads and copy in parallel as many buffers as you like. Of course, you will need to implement some synchronization, as usual.

Safety in OCaml

A feature is unsafe if it may break type system or bypass array bounds check.

  1. any usage of Obj.magic is unsafe and should be reserved for standard library implementers.
  2. any usage of a primitive is even more unsafe, and even more fragile.

So if not sure, you should rely on some library that provides high-level constructs that are based on those primitives. (Basically, it should be either OCaml standard library or Janestreet Base/Core).

What is OCamlish

Let’s not confuse low-level programming with unsafe features or non-idiomatic style. OCaml is an excellent language for low-level and system programming. It suits excellent into real-time and HPC environment. It requires, however, some understanding of how it works, and careful design of the interface. The same as with other languages, different contexts imply different styles. That’s true even for C. So low-level programming is not non-idiomatic for OCaml, see for example this book, as well as lots of examples from the Mirage team. With all that said, it is non-idiomatic to apply the high-level style to low-level code. So no arrays, strings, allocations, records, and other high-level sugar in the low-level code. You can still use the full power of OCaml, especially relying heavily on abstract data types to protect your invariants, and on GADT to make typesafe flags and enumerations.

Speaking on GADT, I’ve seen somewhere in your code that you’re using variants with payloads. You can easily substitute it with GADT, without a payload, which won’t require any allocations, e.g.,

type 'a opt  = 
  | Debug : unit opt
  | TTL : int opt
  | Threshold : float opt

let setopt (type t) (opt : t opt) (arg:t) = match opt with
    | Debug -> printf "enabled debugging"
    | TTL -> printf "set TTL to %d" arg
    | Threshold -> printf "set threshold to %g" arg;;

To summarize, an idiomatic OCaml code could be efficient, but you have to change your style. And that is not to say, that you’re no longer coding OCaml, in fact, you should be even more OCamlish than ever, and use more types, more modules, more abstractions, because you have more invariants to protect.


#16

Wow, excellent post, thanks a lot!

no bound checks are done

I assumed the primitives actually did bounds checking; good thing I went through cstruct.

No, even with flambda and O3.

Why not if I may ask? Is there a rule of thumb to predict optimization behavior?

See the output of ocamlopt -help for more options.

Any reason they’re not in the man page?

I would have a pool of abstract iovec data structures, which will be bigarrays underneath the hood

How do I allocate the iovecs to the user, though? Ideally without the necessity for manual memory management. I already have mempools that maintain a stack of preallocated packet buffers.

I started reading this book a while back but never actually read past the first chapter. I will take another look though!

I actually initially implemented the register type with GADTs, but never got it to work without Obj.magic and at some point just gave up and went back to regular variants. Do feel free to point out my mistakes though, I’m curios what I did wrong. I asked for suggestions on reddit a while back, but never implemented any of the solutions that were suggested. Shame on me.


#17

Is it? Are the %caml_bigstring_{g,s}et32 primitives unsafe?

Yeah, it does the bound checking, what a pity.

Will do! Any resources along the lines of “gprof+OCaml for dummies”?

I’m not sure if OCaml has any peculiarities here, if your driver works in userspace, you should be able to attach perf [1] for example to your process and check which symbols consume much time. OCaml name mangaling is very trivial, and your project structure is flat (no complex module hierarchies), so symbol names would be obvious.

This might sound dumb, but what exactly should I collect and post? How much load should I generate? More than what the driver can handle? The raw perf dump is pretty huge; how should I process it?

I think the list of the most time-consuming symbols would suffice for a start. I don’t thing the load matters here much since on your charts OCaml looks consistently slower that C#, C and Rust.

Also @ivg had given an incredible and comprehensive summary on what you can do with OCaml and which pitfalls you should avoid. I would like to add a couple of examples of more or less fast OCaml code.

As it was mentioned, you would like to avoid allocations as much as possible, so you would like to check JaneStreet libraries, since they had succeeded in avoiding them. They use [@@inline] and [@@unboxed] attribs heavily, but also a fancy metaprogramming stuff like [2] (although I’m not sure if it’s still relevant). Mirage tcpip stack could also be the source of inspiration [3].

[1] http://www.brendangregg.com/perf.html

https://fedoramagazine.org/performance-profiling-perf/

[2] https://github.com/janestreet/ppx_optional

[3] https://github.com/mirage/mirage-tcpip/tree/master/src/tcp


#18

At first glance, it looks like rx_batch and tx_batch are used in the App.fwd (anywhere else?) and it’s not at all clear to me that array is the best container for this purpose. I bet you would notice a speed improvement by using list instead. The reason is that you’re paying a high allocation price for the mutable random-access array of colocated elements in the major heap, and you’re only using the container features that would be just as fast with a single-linked list of elements drawn from the much faster minor heap. If it were me, I’d also look into using Stdlib.Seq.t instead of list but that would defeat the “batch” nature of the functions.


#19

Oh wow, I just implemented a naive list based version and it’s now ~0.5 Mpps faster. Thank you very much, I’ll commit the new version in a few days.

Edit: ~0.5 per direction that is, so 1 Mpps in total.


#20

Well, it is not that trivial as it looks, especially since OCaml is trying not to do magic stuff and ad-hoc optimizations. But I think I’ve jumped over a few intermediate steps. I will explain below.

OCaml, even with flambda, is a very predictable compiler. It basically follows the code that you’re writing. It doesn’t try to be clever and to guess what you’re trying to do, instead it implements as efficiently as possible whatever you have written. For example, suppose you would like to write a function int_of_bool, especially since there is no such function in the standard library. So you will probably write something like this:

let int_of_bool x = if x then 1 else 0

And OCaml will fairly translate this into a branching instruction. It won’t perform any fancy optimizations like branch elimination, hoisting, cse elimination, or anything like this. It will, however, generate code that will represent high-level functional programming concepts into a very efficient implementation, i.e., it will uncurry functions, try to remove many indirect calls, compile pattern matching to efficient binary trees, perform cross-module optimization, and, of course, will apply expression optimizations (constant folding) and a fair amount of inlining. Optimizations are applied in the backend where no type information is available, so the compiler no longer knows that the function int_of_bool has type bool -> int , nor that they share the representation, so nothing is left here.

With Flambda the story is basically the same, except that the flambda will apply optimizations more aggressively, and the optimizations are applied in the middle-end in the new representation called FLambda which is much richer that enables more optimizations.

That’s all not to say, that OCaml compiler is bad, not optimizing or anything like this. It generates excellent and straightforward code, with no stupid jokes (I’m saying this is a guy who spends most of his life looking into the binary code :smiley:). OCaml is good, but it will fairly translate bad code into bad binaries.

Concerning the int_of_bool problem, it could be easily resolved on the standard library level. It just never came to the table, so nobody looked into it. I’m not sure how much it affects your code or any production code. But if you will find then you have this operation in a tight-loop, you should know that you can always get away of it, since it is a no-op underneath the hood. And it is not necessary to use Obj.magic, you can, for example, create your own bool type, e.g.,

module Bit : sig 
    type t = private int
    val one : t
    val zero : t 
    val of_int : int -> t
    val to_int : t -> t
    val is_set : t -> bool
    (* ... etc ... *)
end = struct 
    type t = int
    let one = 1
    let zero = 0
    let of_int x = x <> 0 [@@inline]
    let to_int x = x [@@inline]
    let is_set x = x = 1 [@@inline]  
end

So that the dd function can now return Bit.t, which is an int underneath the hood (and compiler knows that it has the int representation), but its invariant (that it is either one or zero is protected by the module system).

Because they left undocumented, as they are considered a debugging tool for compiler developers, and the representation is also undocumented. But should it stop us? Forbidden fruit tastes the sweetest :slight_smile:

Wow, that’s actually very wrong :slight_smile: In fact, your use of Obj.magic here is totally inappropriate as it breaks type system and will lead to segmentation faults. For example, in this function, you’re downcasting cast 'a reg to int reg, but imagine what will happen if someone will pass int32 reg there? Another problem with your code, is that you’re not using abstractions properly, type t = int doesn’t create a new type, it introduces a type alias, which basically another name (or type constructor) for the same type. So all your registers are indexed with the same type, all having type int reg. This will introduce a new type.

type t  = Fctrl of int [@@unboxed]

This will also introduce a new type

module Fctrl : sig 
   type t [@@unboxed]
end = struct 
   type t = int
   let bam = 0x00000400 (* Broadcast Accept Mode *)
end

and this will also create a new type,

module Fctrl : sig 
   type t = private int
end = struct 
   type t = int
   let bam = 0x00000400 (* Broadcast Accept Mode *)
end

Next, the whole idea of using GADT instead of plain ADT is that you can pass payloads independently of the constructor. Basically, instead of doing set (TTL 1000) which will allocate a boxed value (TTL 1000) then pass it to set, which will extract it, and throw away, you can do set TTL 1000 which will call a function with two immediate arguments, and the TTL type will prescribe the type of the second argument.

So this is wrong (well, in our context):

type 'a reg = 
  | X : x -> x reg
  | Y : y -> y reg

This is what you want:

type 'a reg = 
  | X : x reg
  | Y : y reg

Now, you have N different registers each indexed with the type of a datum that you can store in it. From the terms of OOP it means that you have N different classes sharing the same base. So when you writing the set method, you should dispatch it over all possibilities, e.g.,


module Status : sig 
   type t = private int
   val s1 : t
   val s2 : t
   val s3 : t
   val s4 : t
end = struct 
   let s1 = 0b01
   let s2 = 0b10
   let s3 = 0b11
   let s4 = 0b00
end

type 'a reg = 
  | Status : Status.t reg
  | AL : int reg
  | AH : int reg 

let set : type t. t reg -> t -> unit = fun reg arg -> match reg with
  | StatusRegister -> 
   (* in this branch `t` is refined to `Status.t` *)
   set_status_register arg 
  | AL -> (* here `t` is refined to a completely different type `int` *) 
    set_al arg
  | AH -> 
   (* here it is also `int` but in modern OCaml 
      you can't yet unify AL and AH branch *)
    set_ah arg

Well, this is a user responsibility in general to provide the input buffer, so I’m not sure why on this problem occurs. Consider the readv, writev in posix, they let the user decide on the allocation Policy. So your interface should basically have the same interface - it should accept user data, along with a descriptor, and the data representation should be protected using abstractions, if possible.