Performance difference between bigarray inside record and plain bigarray access

Hello :wave:

today I noticed a behavior I don’t have an explanation for.
Maybe some of you can explain, why this happens. :slightly_smiling_face:

Using the following code, the Io implementation where the bigarray is packed inside a record runs segnificantly faster than the one accessed directly.
To me this is really counter intuitive.

open Bigarray

type 'a img = {width: int; height: int; image: 'a}

type data = (int32, int32_elt, c_layout) Array1.t

module type IO = sig
  type t

  val make : unit -> t img

  val read : x:int -> y:int -> t img -> Int32.t
end

module Io_direct : IO = struct
  type t = data

  let make () =
    let width = 4000 in
    let height = 4000 in
    let data = Array1.create int32 c_layout (width * height) in
    Array1.fill data 255l ;
    {width; height; image= data}

  let read ~(x : int) ~(y : int) img =
    Array1.unsafe_get img.image ((y * img.width) + x)
end

module Io_record : IO = struct
  type t = {data: data}

  let make () =
    let width = 4000 in
    let height = 4000 in
    let data = Array1.create int32 c_layout (width * height) in
    Array1.fill data 255l ;
    {width; height; image= {data}}

  let read ~(x : int) ~(y : int) img =
    Array1.unsafe_get img.image.data ((y * img.width) + x)
end

let run name image read =
  let start = Unix.gettimeofday () in
  let () =
    for y = 0 to image.height - 1 do
      for x = 0 to image.width - 1 do
        let _ = read ~x ~y image in
        ()
      done
    done
  in
  let ms = (Unix.gettimeofday () -. start) *. 1000. in
  Printf.printf "%s io took: %f ms \n" name ms

let () =
  let image = Io_direct.make () in
  run "direct" image Io_direct.read

let () =
  let image = Io_record.make () in
  run "record" image Io_record.read

The output it generates is

direct io took: 445.538998 ms 
record io took: 66.195965 ms 

Can you explain why the “record io” is so much faster?
Is there some optimization which only runs on records?

Thank you! :blush:

1 Like

Wrap a value into a record adds an indirection. That mean that if you want an access to the underlying bigarray, an extra operation is needed (to dereference the pointer inside the record and finally load the bigarray).

You can see such indirection in assembler for this snippet for instance:

type t = { v : int }

let f0 t = t.v
let f1 v = v

And in assembler (x86_64):

camlExample__f0_8:
        movq    (%rax), %rax
        ret
camlExample__f1_15:
        ret

However, it exists a way to “optimize” your record. Indeed, in some situation, you need to wrap a value into a record, specially when you want the universal quantification (from the type system). In such case, you can use [@@unboxed] - be aware that it changes the caml representation then (so you need to be aware about such detail when you want to do some FFI stuffs).

type t = { v : int } [@@unboxed]

let f0 t = t.v

And in assembler:

camlExample__f0_8:
        ret

Thanks for your reply, but I think you misread the question

The one wrapped inside a record is faster. I would also think that accessing the value inside the record and then accessing the bigarray would be slower.
Thats why it is so counter intuitive to me :grinning_face_with_smiling_eyes:

Ah yes, interesting case so :smiley: you give me a rabbit hole!

So I take a look on the assembly generated and it seems that OCaml do something smart when it manipulate a record. A simple call to Array.unsafe_get emits a call to caml_ba_get_1 as expected, However, with a record, the assembly is a bit different when, in this path, we know that we manipulate an int32.

I think it’s mostly due to the int32 and the combination of:

And:

In other words, you should need to “specialize” Io_direct.read and its argument img to enforce the fact that it is a data - because Array1.unsafe_get does not constraint it to be a bigarray of int32. By this way, you can let the opportunity to OCaml to specialize the way to load a data from the given bigarray instead of a simple call to caml_ba_get_1 (which is “polymrophic”).

2 Likes

I would double-check on godbolt.org/ but my first intuition, based on
recent experiences, is that the record forces a monomorphic type on the
bigarray. Accessing a monomorphic bigarray is just a few assembly
instructions; a polymorphic bigarray will require a C call, I think.

4 Likes

Indeed, the read in Io_direct should fix the type of the bigarray (as given above, it has type x:int -> y:int -> ('a, 'b, 'c) Stdlib__bigarray.Array1.t img -> 'a, implying as @c-cube said, that the code generated goes through a generic C function). One way to fix the type is:

  let read ~(x : int) ~(y : int) img =
    Array1.unsafe_get (img.image: t) ((y * img.width) + x)

The funny thing is that this code is still a couple of ms slower. To make it faster one must “fetch” the bigarray first:

  let read ~(x : int) ~(y : int) img =
    let i : t = img.image in
    Array1.unsafe_get i ((y * img.width) + x)
5 Likes

Wow! I wanted to say thank you. I have a performance sensitive program in OCaml using big array and adding a type annotation sped up the code ~30%! Is there a list of these little hints for optimization anywhere?

Essentially the lambda primitives which are polymorphic, like compare, =, Array.get and more. An extensive list of them can be found here

2 Likes