Diagnosing large amounts of time spent in GC

I’m trying to optimise capnp-rpc. When I check it with perf record etc. I end up with ~50% of the time it spends running, doing garbage collection.

First off is this normal for this kind of application? One thought that I had was that since each of the requests has to traverse the loopback address, that it will persist for a while, and thus may cause pathologically bad behavior for the gc?

Thus is there any way to reduce this kind of behavior, or tuning paramaters to minimise the impact?

Hi Chris,

50% of time in GC is high. One thing worth pointing out though is that perf profiles based on CPU time rather than wallclock time. That means if the code you’re optimising is doing plenty of IO with only a small amount of other processing then the results you’ll see will skew heavily towards the runtime machinery.

What does the output of time give you when you run your benchmark? That will output the total elapsed time as well as cpu time.

Next up, if it turns out it’s mostly cpu time then you likely want to establish what in the GC is consuming time. What GC functions are taking up the most time based on the perf report?

I’d also dump out the GC statistics using OCAMLRUNPARAM=v=0x400 - that will tell you how many words are allocated, minor and major collections.

Is this benchmark available somewhere?

2 Likes

I assume your test-case is the benchmark you contributed at https://github.com/mirage/capnp-rpc/blob/master/test-bin/echo/echo_bench.ml (thanks!).

A couple of quick suggestions:

  1. The benchmark makes 100,000 requests in parallel. It might be more realistic to limit the number of concurrent requests. e.g. Lwt_stream.of_list ops |> Lwt_stream.iter_n ~max_concurrency:12 (fun v -> v ()).
  2. The default message buffer size in capnp is 8k, which is larger than you need for ping messages. It might be more efficient to use a smaller default (e.g. 100 bytes). There are various places in the code where it would be good to specify better defaults, or expose the default size in the API (for user messages).

There probably is too much allocation going on though, so I’d suggest profiling memory allocations. You might be able to do that using spacetime, or you could try the new memory profiler in OCaml 4.10 (the API for that is hidden by default but it is possible to use it anyway - e.g. see https://github.com/ocurrent/ocurrent/pull/127).

2 Likes

Its the one talex mentioned.

So I initially thought it might have been as you mentioned (re high IO) however it is spending the vast majority of its time on the cpu:

real    0m7.707s
user    0m5.708s
sys     0m1.616s

so 0.4s of IO time.

From the flamegraph it looks like check_urgent_gc and caml_call_gc, both subsequently calling mark_slice and sweep_slice.

Then the output from the gc statistics is:

allocated_words: 931728970
minor_words:     483004749
promoted_words:  52996558
major_words:     501720779
minor_collections: 2031
major_collections: 20
heap_words:      357150208
heap_chunks:     54
top_heap_words:  357150208
compactions: 0

I was trying to see where the buffer size was set. There is a 4k buffer for reads, however I couldn’t see a buffer for writes at the capnp-rpc-net/endpoint.ml level or below (after lunch I’m working my way up from that…).

Right, so it does not look like an IO problem.

From the looks of the GC statistics, most of your allocations are going straight into the major heap. This happens for large allocations (over 256 words). Probably worth using some memory profiling to work out where this is happening.

Your bottlenecks from perf (mark and sweep) also indicate time is going in major heap work.

2 Likes

Since this is an RPC application, it might be possible to pool/reuse buffers. That has been extremely effective in the past in many contexts for reducing GC overhead. I would point you at Mark Hayden’s PhD thesis (Ensemble, Cornell, 1997) but also at the enormous work done in Java over the years to pool buffers for this very reason.

4 Likes

So I had a quick look at this last night and it seems all of your allocation going directly to the major heap are coming from this:

e4d3ef caml_alloc_shr (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
e4dc4c caml_alloc_string (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
dbc48b camlStdlib__bytes__make_92 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
bdf537 camlCapnp__Message__create_1378 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
9fe762 camlCapnp_rpc_lwt__Rpc_schema__init_root_17995 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
b88acc camlCapnp_rpc_lwt__Request__create_5179 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
76db5c camlDune__exe__Echo__ping_3552 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
76d632 camlDune__exe__Echo_bench__fun_1054 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
da3ec2 camlLwt__apply_2167 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
db869a camlStdlib__list__rmap_f_253 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
dacba9 camlLwt_list__iter_p_168 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
76d534 camlDune__exe__Echo_bench__run_client_138 (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
76d9ed camlDune__exe__Echo_bench__entry (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)
769fc8 caml_program (capnp-rpc/test-bin/echo/_build/default/echo_bench.exe)

Is there a way of caching or reusing those messages internally?

I think it would be possible for this benchmark, however in the general case I think each of the messages would be independent. That being said it may be possible to maintain a pool of buffers potentially thus avoiding the gc penalty…

Indeed, I meant pooling the underlying buffers themselves.

As @Chet_Murthy mentioned earlier, many high performance frameworks make aggressive use of buffer pooling. I suggest having a look at Netty’s buffer pool implementation.

So I had a quick look at this last night and it seems all of your allocation going directly to the major heap are coming from this [Capnp.Message.create].

Interesting! I guess if the messages were allocated using a smaller size then they’d go on the minor heap instead, and if the benchmark sent fewer at once then they might never need to be promoted to the major heap.

I strongly recommend reading chapter 4 of Hayden’s PhD thesis: http://www.nuprl.org/documents/Hayden/ensemblesystem.pdf

I will bet that there’s a robust literature on memory-pooling in Java applications, but I don’t really know where I’d look. Especially for a protocol like capnproto, where the entire idea is to not allocate, I think doing something to manage large flat buffers would be “in keeping with the spirit of the system”.

Also, re: “allocate smaller blocks” that really doesn’t help as much as you might think: you’re doing a ton of RPCs in parallel: it’s likely you’ll fill up the minor heap, and then you’ll be stuck again.

It’s been true forever in GCed systems (and, heck,in non-GCed systems) that a significant opportunity for performance improvement lies in “allocation avoidance”. If you avoid allocating memory, you win. It’s still true that too much of that (e.g. by pooling too many different kinds of memory) is also bad (Zorn et al wrote a seminal paper on this IIRC) but stll for the critical datastructure in a system, it’s a valuable approach.

2 Likes

It would be nice to have a generic buffer pool library in OCaml for instances such as these.

1 Like

TL;DR I think that a good start for such a library would be to carve out Ensemble’s “iovec” library. But most people would see that as going too far, I fear.

it is difficult to implement such a thing for -any- language, because all the hard parts involve the interaction with the “host system”. That is, how will the pool library recognize that a particular buffer is no longer in use, and can be recycled? It’s everything, really. If we make the assumption that there’s a definite entrypoint (e…g “release()”) that signals that, then sure, it’s easy to create such a generic library. And so, for Java/J2EE, where much of the pooling happens in layers of “I/O streams/readers/writers”, that’s possible, b/c those have well-defined lifetimes (associated with the request/response lifecycle. In this case (of capnproto) that’ll be true also, for similar reasons, except that it will require caveats to users that they must not retain pointers to incoming messages, etc.

Also, if you’re really concerned about this sort of performance, you don’t want those buffers to be in the major heap, either, yes? You’d want them outside the heap completely. Going further, if you’re -really- concerned about performance, e.g. let’s say you’re working with an Infiniband/RoCEE card, you’re going to want to pin those buffers to physical memory and register them with the card and associated kernel library. More complexity.

I did the above with Ensemble’s “iovec” library, and it was quite lovely. Thing is, it’s 'way overkill for most applications, that just don’t need microsecond-level performance.

1 Like

I made a PR to test the effect of limiting concurrency and reducing the buffer sizes: https://github.com/mirage/capnp-rpc/pull/200

The results on my machine were:

Original:

echo_bench.exe: [INFO] rate = 23975.514051
echo_bench.exe: [INFO] rate = 23774.676839
echo_bench.exe: [INFO] rate = 23807.165267

Limit concurrency:

echo_bench.exe: [INFO] rate = 28369.413333
echo_bench.exe: [INFO] rate = 28242.881411
echo_bench.exe: [INFO] rate = 28636.375119

And reduce message buffer sizes:

echo_bench.exe: [INFO] rate = 40065.965541
echo_bench.exe: [INFO] rate = 40835.360431
echo_bench.exe: [INFO] rate = 40146.473277

So, about 1.7x faster.

3 Likes

There’s a “learning” from transaction-processing, which might be useful to you. It is that the graph of throughput-as-a-function-of-presented-concurrency (so: “throughput” on the Y axis, “presented concurrency” on the X axis") starts off a straight line, then flattens out, and then -drops-. This is referred to a the “knee in the curve”. Finding that knee is useful, and can tell you things about your system.

It might be useful to build an automated benchmarker that runs benchmarks with varying parameters, collecting results so you can build such graphs. I certainly found it very, very useful in debugging Java runtime and web-app-server performance, back in the day.

ETA: One other thing: when you find that “knee”, if the drop-off is severe, it always means there’s a bug somewhere. A typical such bug is lack of “admission control”: as early as possible in a TP system, there should be a throttle on admitting trans, so that the internal concurrency of the system never exceeds some known-maximum level. It is never useful in a TP system to admit new work, when in-flight work already consumes all of one or more critical resources.

2 Likes