Unexpected poor performance of comparison loops

VPhantom · August 2, 2023, 1:12pm

Hi! In my venture to create a few operations on slices of bytes, I have met surprising performance when working on a sliced equivalent to Bytes.equal. That is, a hypothetical Bytes.equal_slice a a_pos b b_pos len:

┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ Bytes.equal 1KB      │  51.88ns │         │      9.62% │
│ Big.memcmp 1KB       │ 539.21ns │         │    100.00% │
│ Slice.equal 1KB      │ 388.13ns │         │     71.98% │
│ Slice.equal_copy 1KB │ 113.86ns │ 254.00w │     21.12% │
└──────────────────────┴──────────┴─────────┴────────────┘

The first test is simply Bytes.equal bytes_a bytes_b as my benchmark. The second is Bigstringaf.memcmp bigstring_a bigstring_b. The fourth test is a naïve and allocating Bytes.equal (Bytes.sub a a_pos len) (Bytes.sub b b_pos len).

The third test was my attempt to compare byte slices with a loop. The number of calls to Bytes.get_* was the bottleneck, so my best is:

source: slice_equal

let slice_equal a a_pos b b_pos len =
  let ints = len lsr 3 in
  let chars = len - (ints lsl 3) in
  try
    for i = 0 to ints - 1 do
      if not (Int64.equal
             (Bytes.get_int64_ne a (i + a_pos))
             (Bytes.get_int64_ne b (i + b_pos))
          )
      then raise_notrace Exit
    done;
    for i = 0 to chars - 1 do
      (* Yes wrong offset, not significant in this test ;) *)
      if Bytes.get a i <> Bytes.get b i then raise_notrace Exit
    done;
    true
  with Exit -> false

These results leave me scratching my head with two questions:

Bigstringaf.memcmp is a wrapper for the C function. How can it be a whopping 10x slower vs Bytes.equal which calls the runtime’s caml_string_equal(), which itself is ultimately a for loop over each byte. I thought that memcmp() would perform similarly since it’s the same kind of loop. (Note: I am using Musl libc, thus https://git.musl-libc.org/cgit/musl/tree/src/string/memcmp.c. I compared identical chunks to make sure the whole thing was scanned in all tests, so Musl’s lack of early return is not a factor.)
Copying the two byte slices temporarily does allocate but remains fast. I guess this is because the allocations are small enough to stay on the minor heap.

For fun, I increased from 1KB to 10KB and then indeed equal_copy showed its weight:

┌───────────────────────┬─────────────┬──────────┬──────────┬────────────┐
│ Name                  │    Time/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────────────┼─────────────┼──────────┼──────────┼────────────┤
│ Bytes.equal 10KB      │    378.11ns │          │          │      3.09% │
│ Big.memcmp 10KB       │  5_282.07ns │          │          │     43.21% │
│ Slice.equal 10KB      │  3_819.56ns │          │          │     31.25% │
│ Slice.equal_copy 10KB │ 12_223.97ns │   2.56kw │    0.23w │    100.00% │
└───────────────────────┴─────────────┴──────────┴──────────┴────────────┘

…but the loop over int64 remains 10x slower than Bytes.equal and I don’t understand why the difference is so big. (I also tried 16 and 32 bit reads, but the bottleneck is the number of read operations, so 64-bit won.)

For what it’s worth, I did these quick tests with OCaml 4.13.1 with flambda and -O3.

vlaviron · August 2, 2023, 1:35pm

The read operations that you’re using introduce length checks. These length checks are not currently optimized away, so you pay a bunch of extra comparisons and a branch each time.
There are unsafe versions of the same primitives, but they might not be exported in the standard library (check ocplib-endian, which likely defines them all).

Also, I just checked and Int64.equal is implemented as compare x y = 0, which is silly and slow. Use Bytes.get_int64_ne a (i + a_pos) = Bytes.get_int64_ne b (i + b_pos) instead, you should see a noticeable boost.

Musl libc iterates on bytes, while the implementation in the OCaml runtime iterates on words, like your own implementation (see Delegate implementation of caml_string_equal to libc's memcmp by krtab · Pull Request #12427 · ocaml/ocaml · GitHub for a small relevant discussion).

VPhantom · August 2, 2023, 2:07pm

You are spot-on, thank you for the hints!

Switching to the polymorphic comparison <> dropped from 3800 to 3000 ns. I keep forgetting to let the compiler try to optimize those into simple operations.

Switching from Bytes.get_int64_ne to EndianBytes.NativeEndian_unsafe.get_int64 got that down to 1100 ns.

It’s still 3x slower than Bytes.equal but that is more comfortable than my initial 10x. It’s also a surprising 3-5x faster than Bigstringaf.memcmp.

I’ll just need to add bounds checks in my cleaned-up version.

vlaviron · August 2, 2023, 2:56pm

One non-trivial difference between the pure OCaml version and the runtime comparison in caml_string_equal: the code generated by ocamlopt is not allowed to keep derived pointers, so the best we can implement is equivalent to the following C code:

int compare_no_derived(word* a, word* b, int len) {
    int cursor = 0;
    for (; cursor++; cursor < len) {
        if (a[cursor] != b[cursor]) return 0;
    }
    return 1;
}

On the other hand, the C implementation can ensure that no GC occurs during evaluation, and update the pointers directly.
It probably explains a significant part of the remaining gap between Bytes.equal and your version.
Also, the native backend for OCaml is not as good for C-like code as a modern C compiler, and this could have a noticeable impact.

Topic		Replies	Views
Performance of string comparison Learning	3	2328	February 5, 2020
`Bigstringaf` and `Core.Bytes`: performance and when to use each? Learning bytes	0	161	April 22, 2024
Why is a hand-written equal function faster than ( = )? Learning	4	830	April 23, 2022
Performance difference between bigarray inside record and plain bigarray access Learning performance , beginner	8	745	August 16, 2021
Mysterious performance difference between close to identical programs Learning multicore	8	1194	January 17, 2022

Unexpected poor performance of comparison loops

Related topics