Factoring code results in a 500% performance hit

Hi,

I’ve been working on a file sorter for better compression in archives. It orders files by similarity; a component in the similarity computation of two files is akin to a first stage of compressing them together. I’ve improved the performance by a factor of 100x probably but all the code is in a single function and as soon as I try to factor it, performance is hugely impacted. Is there a way to approach this without comparing assembly outputs?

The code is very imperative and I’m not sure it can be improved much more but I need to factor the code in order to use the parallelism in OCaml 5 (the problem is embarassingly parallel and, yes, multiple processes might work well but where’s the fun?).

The relevant code is at bin/hc.ml · main · adrien / compsort · GitLab and is well commented. The toplevel README.md also explains how to run a benchmark (takes around 3 minutes and 1 gigabyte of memory on my laptop). (URI for git clone: adrien / compsort · GitLab )

In addition to being unable to refactor, I’ve also tried to get “perf annotate” to work but I’m not getting source code lines or snippets which would be very helpful. Is it expected to work?

Excerpt (yes, around 88% of the time is spent on only two instructions!):

 37.41 │       movzbq   (%rsi,%rdx,1),%rsi
  0.53 │       lea      0x1(%rsi,%rsi,1),%rsi
  0.50 │       cmp      $0x1,%rsi
  0.06 │     ↓ jne      933
  0.04 │       lea      camlDune__exe__Hc__matcher_786,%rsi
  1.71 │       mov      (%rsi),%rsi
  0.02 │       mov      0x8(%rsi),%rsi
 40.31 │       movslq   -0x2(%rsi,%rdi,2),%rdi

Thanks!

I’ve removed some commented-out code after noticing that indentation was wrong in gitlab’s UI but it looks like it’s done in blocks of 70 lines so there are “jumps” every 70 lines which is only two jumps currently and doesn’t look /too/weird.

It’s not clear from your post what kind of refactoring costs you so much performance, so I can’t really help you with that.

Regarding “perf annotate”, it doesn’t work for me either; I suspect that perf doesn’t like the format of debug info that ocamlopt generates, but it’s hard to know (gdb can read the debug info just fine).

Finally, the two instructions that take up most of the time aren’t the exact ones showed by perf annotate. These are the instructions the CPU was on for the most time, but it was probably stuck because an earlier instruction didn’t produce its output yet. In practice I expect the time is spent in the two memory loads in these lines:

        (* Don't go further if the hash has already lead to a match. *)
        if BA1.unsafe_get matcher_mask k = 0 then (
          (* Lookup in files.(x) where the same hash exists. *)
          let i = Int32.to_int (BA1.unsafe_get matcher k) in

Thanks for taking a look!

You’re right, I should have provided an example. Here is one: hc: split bigarray creation and use; runs four times slower. (28a0a014) · Commits · adrien / compsort · GitLab

I hadn’t tried this exact change before. Now it looks like the cause could be generic code being used for bigarrays. I’ve now peppered the new function with type annotations for the bigarrays and I think I’ve recovered the performance but is there a better way to locate when type annotations would be helpful and when they’re not needed?

I don’t even fully understand why the type of matcher is not fully known since I’m using bigstringaf in places and that should constrain the type… I guess looking at the compilation outputs in order to spot generic code could help, but that sounds sub-optimal.

As for the code, I expect that k is basically random and therefore matcher.(k) and matcher_mask.(k) cannot be predicted/fetched in advance. I actually even tried to add a bloom filter but it didn’t improve performance (my tuning was likely not good but I also had this issue of sudden performance degradation). That could explain the cost and match your assessment.

I expect that matcher_mask improves performance because it’s smaller (char array instead of int64). I haven’t been able to pack stuff more tightly (i.e. as single bits), probably because that requires bit manipulations which were too expensive but it’s tempting to try again…

Bigarrays are known to be problematic, indeed.
The issue is that at the type-checking phase, the compiler will infer the most generic type it can. But for bigarray accesses, they can only get compiled efficiently if the type has enough constraints.
Adding type annotations can help by making sure the compiler can’t infer types that are too generic; another solution (that I don’t recommand) is to define the function foo in you commit in the following way:

let foo = (fun f -> f) (fun ~files ~hashes ~file_count ~results ~matcher ~matcher_mask ~matcher_hits ->
  ...

The weird identity function in front prevents the compiler from generalising the type of the function, and as a result the use of the function will add more constraints to the type of the arguments, and the bigarray accesses will get compiled efficiently.

But what I would actually recommend is to redefine the bigarray primitives you need with a specialised type:

let[@inline always] bigarray_get_char
    (ba : (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t)
    idx =
  Bigarray.Array1.get ba idx
(* Similar for the other types and accessors *)

This way, you don’t need to put type annotations in the code itself, and you don’t risk using non-specialised primitives by mistake.

3 Likes

I’ve introduced helper functions as you suggested as that’s much cleaner. I think I’m not seeing a performance degradation anymore and I’ve actually been able to take advantage of domains to parallelize the processing which gives me a much nicer effective speed!

I’ll probably also try to switch to multi-processing afterwards as I’m seeing a speedup that is quite a bit less than linear.

Thanks again!

I shall mention that the way I attempted to have bigarrays reused inside domains (one domain = one set of bigarrays) was wrong (it’s not a surprise but I didn’t know the API back when I wrote that).

I’ve now written a basic resource pool using Domainslib.Chan: I create everything upfront and receive and send back around each processing step.