Hannes Mehnert interview about MirageOS and OCaml by Evrone

elizabethlvova · May 16, 2020, 6:14pm

gadmm · May 18, 2020, 12:53pm

Thank you @elizabethlvova for this link.

Hi @hannes, and other MirageOS developers if they know the answer. I am curious about the soft real-time applications.

What kind of latency do you target, and what kind of latency does OCaml allows you to achieve? Are there concrete evaluations about it in the context of MirageOS? (Bonus internet points if they are public so as to be referenced in a paper, that would be very helpful to me!)
I have learnt on this discuss that low-latency can be obtained in OCaml by writing in a special style where you promote very little. Do you sometimes have to pay attention to your allocation patterns when you program for MirageOS? Have you ever had to profile an application for latency, and fix it by changing allocation patterns?

hannes · May 19, 2020, 6:13pm

I am not sure whether there’s anyone focussing on low-latency MirageOS unikernels. My goal is to first get robust and sustainable infrastructure. I have been playing a bit with an old version of statmemprof to figure out allocation profiles (and landmarks for profiling code), but I am not aware of any in-depth allocation analysis. The closest I am aware of is httpaf’s motivational benchmarks https://github.com/inhabitedtype/httpaf#performance. Also https://github.com/mirage/mirage/pull/968 in respect to our IP stack, but I still feel there’s room for improvement (such as using String/Byte instead of Bigarray; avoid allocation of small structures when sending data).

BikalGurung · May 19, 2020, 9:20pm

It is interesting you mentioned this. Isn’t the usage of bigarray more efficient than String/bytes? I think httpaf uses bigstringaf and faraday which seems to pervasively use bigarray as its primary buffer data structure. Isn’t this a performant choice?

bluddy · May 20, 2020, 3:39am

Compared to string/byte, Bigarray doesn’t have much of an advantage. Allocation is always relatively expensive, and accessing it requires going through the C API, whereas small strings/bytes are allocated on the minor heap cheaply. The main advantage is bypassing the size limit of strings(and arrays) on 32-bit platforms, which is a few megabytes.

Bigarray really starts to shine compared to OCaml arrays, which are scanned by the GC element by element.

nojb · May 20, 2020, 4:09am

The main interest of bigarrays is that they are allocated using malloc in the C heap and so are not moved by the GC. This means they can be passed to C functions safely. On the other hand, malloc was not designed to sustain an allocation regime like that of OCaml (which allocates small objects at a fast rate). When you use bigarrays for small structures which are allocated at a high rate that can lead to fragmentation problems and high memory usage (precisely because they cannot be moved by the GC!).

Cheers,
Nicolás

hannes · May 20, 2020, 6:05am

Oh, that’s definitively something I’d like to figure out for a whole system - now that we have a firewall, a DNS server (storing zonefiles in a git repository), a TLS reverse proxy, a CalDAV server, an OpenVPN gateway, a static site webserver (data in a git repo) - let them run with bigarray as is, and at the same time run a version using String/Bytes and compare the CPU and memory characteristics.

I guess the allocation strategy “one big struct” (see awa-ssh/lib/dbuf.ml at main · mirage/awa-ssh · GitHub) vs “lots of small byte vectors” (see ocaml-tls/lib/writer.ml at main · mirleft/ocaml-tls · GitHub) should be evaluated appropriately as well. If we’re talking about networked services, my suspicion is if the network device allocates MTU-sized buffers, and the upper layers fill it with data, the least fragmentation is done. For safety we have an implementation using phantom types for read-only / write-only (which we do not use widely yet) https://github.com/mirage/ocaml-cstruct/pull/237.

TL;DR: so much to do, so little time.

BikalGurung · May 20, 2020, 9:02am

It is much clearer now. I somehow got the impression that using Bigarray as a buffer data structure was more efficient than stdlib Buffer data structure. It seems it is not clear cut for now. Thanks for your responses.

dinosaure · May 20, 2020, 10:01am

I just would like to add a pro about bigarray, due to the fact that a bigarray can not move in your heap, we have the ability to release the runtime lock for some computations such as hash algorithms as digestif does:

github.com/mirage/digestif

Release rumtime lock in XXX_ba_update functions to allow parallel execution

mirage:master ← andersfugmann:release_runtume_lock

opened 07:13AM - 05 Oct 18 UTC

andersfugmann

+17 -13

The changes adds an overhead of around 70ns(8ns vs 79ns) on an Intel 7-7500U CPU… @ 2.70GHz. the overhead is primarily introduced by the release and acquisition of the runtime lock (50ms). 20 ms is used for copying ctx onto the stack and back. This could be reduced by letting the c-stub create a new ctx itself, and remove the need of calling `dup`, which saves around 20ns per call. See #69.

About MirageOS, we currently mostly use cstruct which has an other difference with bigarray, the underlying record. Such design is to be more efficient when we do a sub operation as @ivg said here: Working with a huge data chunks - #10 by ivg

However, the question to choose Bytes.t or Cstruct.t (or Bigstring.t) is a bit hard and it really depends on your context - and, as @xavierleroy said :

Mirage people don’t seem to care, as they allocate small bigarrays like crazy.

avsm · May 20, 2020, 10:43am

This is a good question, and it’s helpful to understand what each datastructure is backed by, and what operations are inefficient.

Bigarray is a pointer to externally allocated memory of arbitrary length. It supports creating smaller views of the same memory without copying it, which is implemented at the OCaml runtime level. Accessing data within bigarrays is fast thanks to some compiler primitives which allow for endian-neutral parsing and serialisation, implemented by ocplib-endian.
Bigarrays are extremely convenient for network IO, since they support everything needed for minimal copying of data from the OS. You can exchange memory pages directly from the OS into the OCaml heap, and process them. Unfortunately, one operation is critically slow here – creating a substring. Bigarray’s provenance was originally to interop better with Fortran-style HPC code, where the size and dimensionality of arrays is generally large. For IO, we just want really speedy 1-dimension arrays, and in this usecase Bigarray substring creation is very slow due to the underlying reference counting. Thus cstruct was born, which keep a single underlying Bigarray structure and allocates small OCaml records on the minor heap for subviews. These are cheap to create and GC, and the underlying data is not copied unless requested.
Strings are immutable and sit in the OCaml heap, and require a data copy from the outside world into them. Under some circumstances (usually small allocations) they can be more performant.
Buffers are a resizable String, and efficient if you need to concatenate lots of data of unknown size.

So the final answer, as with many systems performance problems about what is “efficient” depends on your allocation patterns. For transmitting data, there is often a number of small pieces of data that are combined onto a set of pages for the write path. In this case, a hybrid of “in-heap” assembly using small strings followed by blitting into a Bigarray is reasonable. For reading, parsing directly from a Bigarray into a cstruct works well.

This is basically all incorrect; please see above. Accessing Bigarrays can be done via builtin compiler primitives that make it fast. And the point of using them is to avoid multiple small allocations, especially on the read path.

The basic approach to low latency OCaml hasn’t really changed much in the last few decades. You just need to minimise allocation to maximise GC throughput, and OCaml makes it fairly easy to write that sort of low level code. Two papers that might be helpful:

“Melange: Towards a functional internet”, EuroSys 2007. Contains a latency analysis of an SSH and DNS server vs C equivalents, and some techniques on writing low-latency protocol parsers. These days, we do roughly the same thing with ppx’s and cstructs, without the DSL in the way.
“Jitsu: Just-in-Time Summoning of Unikernel;s”, NSDI 2015. This shows the benefits of whole-system latency control – you can mask latency by doing some operations concurrently, which is easy to do in unikernels and hard in a conventional OS.

We’ve never really built systems in the “soft realtime” sense so far – for example no video transmission system or isochronous Bluetooth implementations. Internet protocols are very resilient to variable latency, although of course we want to keep things as low as possible. I’ve been looking into multipath multicast video transmission in Mirage recently due to the current work-at-home situation, so that might change soon depending on how it goes

One thing that has changed in the past decade is the steadily improving latency profile of the OCaml GC, which has only been improving thanks to @damiendoligez’s steady work. That has let us get away with not directly addressing latency much in Mirage itself, as every upgrade of the compiler is a pleasant improvement.

And indeed, @xavierleroy is right that we allocate like crazy, with the caveat that this only really happens on the transmission path of most protocols. Reads tend to go through a more minimal copy discipline.

We certainly do care about this, but it has to be fixed upstream in OCaml as we have reached the limits of what we can practically do with Bigarray – I am hoping that multicore OCaml is the perfect time to unify all these IO approaches in that direction as part of that effort. Mirage will benefit from whatever happens there eventually.

jjb · May 20, 2020, 10:50am

Sorry if it is obvious, but for systems that allocate many small bigarrays have the performance experiments considered various malloc implementations? This situation seems like the sort that really distinguishes the likes of jemalloc and tcmalloc.

avsm · May 20, 2020, 10:54am

Yeah, we could bundle one of those implementations easily into the next iteration of nolibc, where we currently just use dlmalloc for simplicity. It just wasn’t a bottleneck when I last tested it, since (for example) we are doing a ton of CPU work in the TLS stack while allocating lots of small substrings. The allocator efficiency is lost in the noise with the cryptography load. That might change in the future as we do more offload to hardware, of course, and implement more streaming protocols.

bluddy · May 20, 2020, 3:37pm

Thanks for the info about Bigarrays @avsm. I created an entry on ocamlverse about them, largely based on what you wrote. Everyone, feel free to offer suggestions.

gadmm · May 20, 2020, 4:32pm

Thanks a lot for the detailed answers, and all the pointers, this looks very interesting!

BikalGurung · May 23, 2020, 10:12am

Once bigarray is allocated, how does one “free” it, or what happens to the memory allocated to a bigarray after the ocaml program terminates, or does one not have to worry about freeing/deallocating it as it seems to be the case when using cstruct? The ocaml manual does not seem clear on this aspect, OCaml library : Bigarray

Below is the scenario of a simple network program doing IO using bigarray,

1. Read 10 bytes from socket into bigarray/cstruct - called b1.
2. Parse from 'b1' BE/LE int8, int16, int32 as required 
3. Do other tasks 
4. exit program

Is my understanding below correct?

In step 1 we do not create a copy of 10 bytes from OS/socket into bigarray.

Cstruct.t helps in step 2 by not allocating intermediate buffers when parsing bytes into int3, int16, int32 etc.

If the above is correct, then indeed for network IO cstruct seems to be the correct lib/technique to implement network protocol.

avsm · June 1, 2020, 2:42pm

The Bigarray runtime source code helps here. It shows that there are several memory management strategies for the underlying memory backing a Bigarray:

If you use Bigarray.*.create then the backing memory will be allocated by the OCaml runtime (using malloc or whatever strategy it has) and then free will be called on it in the finaliser code. This is the BA_MANAGED option and the conventional and simple use. Note how the runtime is careful to check that no Bigarray slices are lingering before calling free on the backing store.
If you created the Bigarray via Unix.map_file, then it will be set to BA_MAPPED_FILE and collected in another finaliser when the OCaml references are collected, which calls munmap. This shows how non-malloc interfaces work fine with Bigarray.
You can also allocate the C memory from a C binding, and use BA_EXTERNAL to indicate that it should not be free’d by the OCaml runtime. This allows full control over that memory by adding custom finalisers or otherwise doing explicit memory control from user code. See for example how mirage-xen maps grant tables using this technique to wrap Xen shared memory into Bigarrays – freeing the system grant tables (which are the fundamental unit of inter-VM communication in Xen) would lead to unfortunate consequences for the future health of that domain.

One the Bigarray is available to OCaml, then it doesn’t matter how it was created. You can just use cstruct or bigstringaf or similar libraries to parse/serialise data to it as normal.

Your steps listed for the read path are good. Now try the same for the write path – that is much more tricky! There are many tricks in (e.g. angstrom) to minimise allocation on the write path which @seliopou has carefully crafted, which now need to be backported onto the full Mirage stack.

The Bigarray interface in OCaml has been a real survivor in much of our code for many years. We really misappropriated it (from its Fortran origins) and it has worked very well (including with respect to runtime locks, as @dinosaure notes). The multicore design I’m working on is really just a simplification – removing support for proxy objects to make slicing faster. It might well be that even this isn’t necessary for raw throughput. 10GBs IO leaves little room for error in the hot data path, but modern CPUs are also really fast.

Topic		Replies	Views
Using Ocaml to write network interface drivers Community	24	4714	February 14, 2020
How to speed up this function? Learning	32	3151	August 21, 2022
[ANN] carray.0.0.1 Ecosystem opam , announce , ffi , c , array	16	1758	June 16, 2022
Buffered IO, bytes vs bigstring Learning	8	1413	December 20, 2021
Some SIMD in your OCaml Community	29	4903	September 15, 2020

Hannes Mehnert interview about MirageOS and OCaml by Evrone

Related topics