Thank you @elizabethlvova for this link.
Hi @hannes, and other MirageOS developers if they know the answer. I am curious about the soft real-time applications.
- What kind of latency do you target, and what kind of latency does OCaml allows you to achieve? Are there concrete evaluations about it in the context of MirageOS? (Bonus internet points if they are public so as to be referenced in a paper, that would be very helpful to me!)
- I have learnt on this discuss that low-latency can be obtained in OCaml by writing in a special style where you promote very little. Do you sometimes have to pay attention to your allocation patterns when you program for MirageOS? Have you ever had to profile an application for latency, and fix it by changing allocation patterns?
I am not sure whether there’s anyone focussing on low-latency MirageOS unikernels. My goal is to first get robust and sustainable infrastructure. I have been playing a bit with an old version of statmemprof to figure out allocation profiles (and landmarks for profiling code), but I am not aware of any in-depth allocation analysis. The closest I am aware of is httpaf’s motivational benchmarks https://github.com/inhabitedtype/httpaf#performance. Also https://github.com/mirage/mirage/pull/968 in respect to our IP stack, but I still feel there’s room for improvement (such as using String/Byte instead of Bigarray; avoid allocation of small structures when sending data).
It is interesting you mentioned this. Isn’t the usage of bigarray more efficient than String/bytes? I think httpaf uses bigstringaf and faraday which seems to pervasively use bigarray as its primary buffer data structure. Isn’t this a performant choice?
Compared to string/byte, Bigarray doesn’t have much of an advantage. Allocation is always relatively expensive, and accessing it requires going through the C API, whereas small strings/bytes are allocated on the minor heap cheaply. The main advantage is bypassing the size limit of strings(and arrays) on 32-bit platforms, which is a few megabytes.
Bigarray really starts to shine compared to OCaml arrays, which are scanned by the GC element by element.
The main interest of bigarrays is that they are allocated using
malloc in the C heap and so are not moved by the GC. This means they can be passed to C functions safely. On the other hand,
malloc was not designed to sustain an allocation regime like that of OCaml (which allocates small objects at a fast rate). When you use bigarrays for small structures which are allocated at a high rate that can lead to fragmentation problems and high memory usage (precisely because they cannot be moved by the GC!).
Oh, that’s definitively something I’d like to figure out for a whole system - now that we have a firewall, a DNS server (storing zonefiles in a git repository), a TLS reverse proxy, a CalDAV server, an OpenVPN gateway, a static site webserver (data in a git repo) - let them run with bigarray as is, and at the same time run a version using String/Bytes and compare the CPU and memory characteristics.
I guess the allocation strategy “one big struct” (see https://github.com/haesbaert/awa-ssh/blob/master/lib/dbuf.ml) vs “lots of small byte vectors” (see https://github.com/mirleft/ocaml-tls/blob/master/lib/writer.ml) should be evaluated appropriately as well. If we’re talking about networked services, my suspicion is if the network device allocates MTU-sized buffers, and the upper layers fill it with data, the least fragmentation is done. For safety we have an implementation using phantom types for read-only / write-only (which we do not use widely yet) https://github.com/mirage/ocaml-cstruct/pull/237.
TL;DR: so much to do, so little time.
It is much clearer now. I somehow got the impression that using Bigarray as a buffer data structure was more efficient than stdlib Buffer data structure. It seems it is not clear cut for now. Thanks for your responses.
I just would like to add a pro about
bigarray, due to the fact that a
bigarray can not move in your heap, we have the ability to release the runtime lock for some computations such as hash algorithms as
About MirageOS, we currently mostly use
cstruct which has an other difference with
bigarray, the underlying record. Such design is to be more efficient when we do a
sub operation as @ivg said here: Working with a huge data chunks
However, the question to choose
Bigstring.t) is a bit hard and it really depends on your context - and, as @xavierleroy said :
Mirage people don’t seem to care, as they allocate small bigarrays like crazy.
This is a good question, and it’s helpful to understand what each datastructure is backed by, and what operations are inefficient.
Bigarrayis a pointer to externally allocated memory of arbitrary length. It supports creating smaller views of the same memory without copying it, which is implemented at the OCaml runtime level. Accessing data within bigarrays is fast thanks to some compiler primitives which allow for endian-neutral parsing and serialisation, implemented by ocplib-endian.
- Bigarrays are extremely convenient for network IO, since they support everything needed for minimal copying of data from the OS. You can exchange memory pages directly from the OS into the OCaml heap, and process them. Unfortunately, one operation is critically slow here – creating a substring. Bigarray’s provenance was originally to interop better with Fortran-style HPC code, where the size and dimensionality of arrays is generally large. For IO, we just want really speedy 1-dimension arrays, and in this usecase Bigarray substring creation is very slow due to the underlying reference counting. Thus cstruct was born, which keep a single underlying Bigarray structure and allocates small OCaml records on the minor heap for subviews. These are cheap to create and GC, and the underlying data is not copied unless requested.
Strings are immutable and sit in the OCaml heap, and require a data copy from the outside world into them. Under some circumstances (usually small allocations) they can be more performant.
Buffers are a resizable String, and efficient if you need to concatenate lots of data of unknown size.
So the final answer, as with many systems performance problems about what is “efficient” depends on your allocation patterns. For transmitting data, there is often a number of small pieces of data that are combined onto a set of pages for the write path. In this case, a hybrid of “in-heap” assembly using small strings followed by blitting into a Bigarray is reasonable. For reading, parsing directly from a Bigarray into a cstruct works well.
This is basically all incorrect; please see above. Accessing Bigarrays can be done via builtin compiler primitives that make it fast. And the point of using them is to avoid multiple small allocations, especially on the read path.
The basic approach to low latency OCaml hasn’t really changed much in the last few decades. You just need to minimise allocation to maximise GC throughput, and OCaml makes it fairly easy to write that sort of low level code. Two papers that might be helpful:
- “Melange: Towards a functional internet”, EuroSys 2007. Contains a latency analysis of an SSH and DNS server vs C equivalents, and some techniques on writing low-latency protocol parsers. These days, we do roughly the same thing with ppx’s and cstructs, without the DSL in the way.
- “Jitsu: Just-in-Time Summoning of Unikernel;s”, NSDI 2015. This shows the benefits of whole-system latency control – you can mask latency by doing some operations concurrently, which is easy to do in unikernels and hard in a conventional OS.
We’ve never really built systems in the “soft realtime” sense so far – for example no video transmission system or isochronous Bluetooth implementations. Internet protocols are very resilient to variable latency, although of course we want to keep things as low as possible. I’ve been looking into multipath multicast video transmission in Mirage recently due to the current work-at-home situation, so that might change soon depending on how it goes
One thing that has changed in the past decade is the steadily improving latency profile of the OCaml GC, which has only been improving thanks to @damiendoligez’s steady work. That has let us get away with not directly addressing latency much in Mirage itself, as every upgrade of the compiler is a pleasant improvement.
And indeed, @xavierleroy is right that we allocate like crazy, with the caveat that this only really happens on the transmission path of most protocols. Reads tend to go through a more minimal copy discipline.
We certainly do care about this, but it has to be fixed upstream in OCaml as we have reached the limits of what we can practically do with Bigarray – I am hoping that multicore OCaml is the perfect time to unify all these IO approaches in that direction as part of that effort. Mirage will benefit from whatever happens there eventually.
Sorry if it is obvious, but for systems that allocate many small bigarrays have the performance experiments considered various malloc implementations? This situation seems like the sort that really distinguishes the likes of jemalloc and tcmalloc.
Yeah, we could bundle one of those implementations easily into the next iteration of nolibc, where we currently just use dlmalloc for simplicity. It just wasn’t a bottleneck when I last tested it, since (for example) we are doing a ton of CPU work in the TLS stack while allocating lots of small substrings. The allocator efficiency is lost in the noise with the cryptography load. That might change in the future as we do more offload to hardware, of course, and implement more streaming protocols.
Thanks a lot for the detailed answers, and all the pointers, this looks very interesting!
bigarray is allocated, how does one “free” it, or what happens to the memory allocated to a
bigarray after the ocaml program terminates, or does one not have to worry about freeing/deallocating it as it seems to be the case when using
cstruct? The ocaml manual does not seem clear on this aspect, https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.html
Below is the scenario of a simple network program doing IO using bigarray,
1. Read 10 bytes from socket into bigarray/cstruct - called b1. 2. Parse from 'b1' BE/LE int8, int16, int32 as required 3. Do other tasks 4. exit program
Is my understanding below correct?
In step 1 we do not create a copy of 10 bytes from OS/socket into bigarray.
Cstruct.t helps in step 2 by not allocating intermediate buffers when parsing bytes into int3, int16, int32 etc.
If the above is correct, then indeed for network IO cstruct seems to be the correct lib/technique to implement network protocol.