Hi! In my venture to create a few operations on slices of bytes
, I have met surprising performance when working on a sliced equivalent to Bytes.equal
. That is, a hypothetical Bytes.equal_slice a a_pos b b_pos len
:
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ Bytes.equal 1KB │ 51.88ns │ │ 9.62% │
│ Big.memcmp 1KB │ 539.21ns │ │ 100.00% │
│ Slice.equal 1KB │ 388.13ns │ │ 71.98% │
│ Slice.equal_copy 1KB │ 113.86ns │ 254.00w │ 21.12% │
└──────────────────────┴──────────┴─────────┴────────────┘
The first test is simply Bytes.equal bytes_a bytes_b
as my benchmark. The second is Bigstringaf.memcmp bigstring_a bigstring_b
. The fourth test is a naïve and allocating Bytes.equal (Bytes.sub a a_pos len) (Bytes.sub b b_pos len)
.
The third test was my attempt to compare byte slices with a loop. The number of calls to Bytes.get_*
was the bottleneck, so my best is:
source: slice_equal
let slice_equal a a_pos b b_pos len =
let ints = len lsr 3 in
let chars = len - (ints lsl 3) in
try
for i = 0 to ints - 1 do
if not (Int64.equal
(Bytes.get_int64_ne a (i + a_pos))
(Bytes.get_int64_ne b (i + b_pos))
)
then raise_notrace Exit
done;
for i = 0 to chars - 1 do
(* Yes wrong offset, not significant in this test ;) *)
if Bytes.get a i <> Bytes.get b i then raise_notrace Exit
done;
true
with Exit -> false
These results leave me scratching my head with two questions:
-
Bigstringaf.memcmp
is a wrapper for the C function. How can it be a whopping 10x slower vsBytes.equal
which calls the runtime’scaml_string_equal()
, which itself is ultimately afor
loop over each byte. I thought thatmemcmp()
would perform similarly since it’s the same kind of loop. (Note: I am using Musl libc, thus https://git.musl-libc.org/cgit/musl/tree/src/string/memcmp.c. I compared identical chunks to make sure the whole thing was scanned in all tests, so Musl’s lack of early return is not a factor.) -
Copying the two byte slices temporarily does allocate but remains fast. I guess this is because the allocations are small enough to stay on the minor heap.
For fun, I increased from 1KB to 10KB and then indeed equal_copy
showed its weight:
┌───────────────────────┬─────────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────────────┼─────────────┼──────────┼──────────┼────────────┤
│ Bytes.equal 10KB │ 378.11ns │ │ │ 3.09% │
│ Big.memcmp 10KB │ 5_282.07ns │ │ │ 43.21% │
│ Slice.equal 10KB │ 3_819.56ns │ │ │ 31.25% │
│ Slice.equal_copy 10KB │ 12_223.97ns │ 2.56kw │ 0.23w │ 100.00% │
└───────────────────────┴─────────────┴──────────┴──────────┴────────────┘
…but the loop over int64
remains 10x slower than Bytes.equal
and I don’t understand why the difference is so big. (I also tried 16 and 32 bit reads, but the bottleneck is the number of read operations, so 64-bit won.)
For what it’s worth, I did these quick tests with OCaml 4.13.1 with flambda and -O3
.