Int32 optimization puzzle

I have the following code of which I don’t quite understand the performance behaviour.

This is a CRC-32 computation made with int32 OCaml integers (I would like to get rid of this int32 unboxing hack in a library, especially since it makes Adler-32 computation actually slower).

Basically on my machine (arm64, ocaml 14.1.0) the runs with the code below are done in 80ms (which is larger than when using the unboxing trick).

But if I replace in the string function the k indexing the table by say the byte and Sys.opaque identity the k, I get runs of 30ms. So something seems not to quite add up, I have looked at the assembly or the lambda code but nothing jumps to my eyes (though I’m rusty on that).

Has maybe someone a hint about what may be happening ?

cat - > /tmp/ <<EOF  
let poly = 0xedb88320l
let table =
  let init i =
    let c = ref (Int32.of_int i) in
    for k = 0 to 7 do
      c :=
        if Int32.logand !c 1l <> 0l
        then (Int32.logxor poly (Int32.shift_right_logical !c 1))
        else (Int32.shift_right_logical !c 1);
    done; !c
  Bigarray.Array1.init Bigarray.int32 Bigarray.c_layout 256 init

let string s =
  let i = ref 0 and max = String.length s - 1 in
  let c = ref 0xFFFF_FFFFl in
  while (!i <= max) do
    let ci = Int32.to_int !c in
    let byte = String.get_uint8 s !i in
    let k = (ci lxor byte) land 0xFF in
    c := Int32.logxor
        (Bigarray.Array1.get table k : int32)
        (Int32.shift_right_logical !c 8);
    incr i;
  Int32.logxor !c 0xFFFF_FFFFl

let random ~length =
  let ic = In_channel.open_bin "/dev/random" in
  let finally () = In_channel.close_noerr ic in
  Fun.protect ~finally @@ fun () ->
  In_channel.really_input_string ic length |> Option.get

let test_speed crc_of_string s =
  let start = Sys.time () in
  let crc = crc_of_string s in
  let stop = Sys.time () in
  let dur = (stop -. start) *. 1000. in
  Format.printf "%dms %lx\n" (truncate dur) crc

let main () =
  let length = (20 * 1024 * 1024) in
  let s = random ~length in
  test_speed string s;
  test_speed string s

let () = if !Sys.interactive then () else main ()
cd /tmp && ocamlopt && ./a.out

If I understand correctly, your replacement code indexes the array by just byte instead of !c lxor byte and thus no longer needs to wait for the memory load c := table[k] from the previous iteration to complete before processing further, right? Then it does not seem that surprising that your code becomes much faster, since the processor has a lot more leeway in reordering the instructions. But perhaps I am misunderstanding what changes your are making to your code.

Just to make things clear ^.

But yes I guess you are right. Somehow my mental model eschews all the crazy stuff modern processors do.

Just to wrap this up. I removed the uint32 as int trick which I was really not happy with. Notably it changes the implementation for js_of_ocaml apps depending on the bitness of the platform on which you compile your app, which is ugly.

Instead I decided to take the slowdown[1] from 2.03s to 2.10s on unzipping the silesia corpus (75Mo that uncompress to 202Mo) and focus to try to improve the CRC-32 computation as OCaml int32. That’s nevertheless worth pursuing since it eats quite a bit in unzipping archives.

To do this I turned to what the internet is to be liked for which is: people obsessing and maintaining webpages on niche subjects. Namely fast CRC-32 computation.

I simply moved one step on the performance ladder shown there and took the “slicing-by-4” approach. That lead to 1.69s[2] unzipping time and made me happy enough to simply go with that (for reference the unzip 6.00 tool on my machine takes 1.12s on that test case) .

  1. But the 1.4 faster Adler-32. ↩︎

  2. I did not try that technique with the uint32 as int ;–) ↩︎