Performance regression

Hi,
I am working on a compiler written in ocaml. We noticed a significant drop in performance of our software after 4.06.1.

All else being equal a run of our test-suite takes the following time depending on the version of the compiler we used to compile the same source-code:

4.06.1: 77.34s
4.07.0: 382.27s
4.08.1: 363.48s

Is it a known issue? Are there things to look out for to avoid getting such a drop in performance?

Yes. See:

Producing the minimized sample and profiling the compiler can help the maintainers to address the particular regressions.

I will try to come up with a minimal example, but it is somewhat difficult since it is a full on commercial project.

The second link talks about performance of the compiler itself rather than the code that it generates. Therefore, I don’t think it relates to my problem.

The first link references a performance drop with flambda, but I was not using that optimization when I performed the test. From this discussion I would expect the performance regression to be seen exclusively when using flambda. Am I misunderstanding something here?

You are indeed not concerned by the two linked issues. And this performance drop is really unexpected. I am not sure what change could affect you between 4.06.1 and 4.07 .

Indeed. It would be really great to get a minimized example of the regression. It seems concerning, especially given the magnitude.

I’m afraid we cannot say much without seeing the code of the projects involved and maybe testing it. Could you give a pointer to your codebase?

(Otherwise the generic advice applies: have you tried profiling your code, have new hotspots shown up?)

Edit: an indication of what is typically the performance bottleneck in your code (computations? I/O? Memory usage?) could also be helpful in trying to somehow guess what the regression could be.

I am working on getting a minimal example. The one thing that could be helpful that I found so far, is that the problem relates to our Qt bindings. More specifically to our file handling using these bindings and the associated memory management/allocation/garbage collection.

We boiled it down to a combination of two things, garbage collection and allocation of custom blocks.

Our project is CPU intensive and we found that expanding the minor_heap_size of the GC gave us a benefit in performance of about 10-20% at the expense of increasing the memory use. To do that we have a line of code that expands the minor heap size it when the executable start.

On other side of the project, we use custom blocks to interact with C++ objects. Initially we created the custom blocks as follows:


obj = caml_alloc_custom(&custom_block, sizeof(Object*), sizeof(Object), 10000);

with the given parameters for used and max of the custom block we found that they were not being finalized as often as we needed. Then we changed the max parameter to 1:


obj = caml_alloc_custom(&custom_block, sizeof(Object*), sizeof(Object), 1);

that worked just ok but we were not having the objects finalized as we fast as we needed. Then we changed our code to explicitly finalize every custom block. This worked perfectly, but we kept the max parameter set to 1.

That parameter in customs blocks seems to be the one giving problems in 4.07.1. The interesting thing is that we didn’t see this in 4.06.1 thanks to our changes in the GC settings. Reverting the GC settings to default (in 4.06.1) show a slowdown.

We have changed the allocation of custom blocks to:


obj = caml_alloc_custom(&custom_block, sizeof(Object*), 0, 1);

with this change we have similar performance with both versions of the compiler.

I would encourage to experiment with memory allocation settings. We are using jemalloc and found it to improve memory performance. However, we have not benchmarked it recently. It is quite easy to do by adjusting the environment.

LD_PRELOAD=/usr/lib64/libjemalloc.so.1
MALLOC_CONF=narenas:1,tcache:false,lg_dirty_mult:22

May be relevant : https://github.com/ocaml/ocaml/pull/1476

This is an old thread but just in case someone copies this advice into their own code we’ve made a slight tweak: CP-45703: jemalloc: avoid bottlenecks with C threads by edwintorok · Pull Request #5200 · xapi-project/xen-api · GitHub.
Using tcache:true instead of tcache:false results in only a 3% increase in memory usage, but a massive improvement in performance of C code invoked from OCaml threads (in XAPI’s case: 17x improvement, YMMV).

(tcache:false wasn’t a problem previously for OCaml 4.x threads, because only one of those ran at a time anyway.)

3 Likes