today I noticed a behavior I don’t have an explanation for.
Maybe some of you can explain, why this happens.
Using the following code, the Io implementation where the bigarray is packed inside a record runs segnificantly faster than the one accessed directly.
To me this is really counter intuitive.
open Bigarray
type 'a img = {width: int; height: int; image: 'a}
type data = (int32, int32_elt, c_layout) Array1.t
module type IO = sig
type t
val make : unit -> t img
val read : x:int -> y:int -> t img -> Int32.t
end
module Io_direct : IO = struct
type t = data
let make () =
let width = 4000 in
let height = 4000 in
let data = Array1.create int32 c_layout (width * height) in
Array1.fill data 255l ;
{width; height; image= data}
let read ~(x : int) ~(y : int) img =
Array1.unsafe_get img.image ((y * img.width) + x)
end
module Io_record : IO = struct
type t = {data: data}
let make () =
let width = 4000 in
let height = 4000 in
let data = Array1.create int32 c_layout (width * height) in
Array1.fill data 255l ;
{width; height; image= {data}}
let read ~(x : int) ~(y : int) img =
Array1.unsafe_get img.image.data ((y * img.width) + x)
end
let run name image read =
let start = Unix.gettimeofday () in
let () =
for y = 0 to image.height - 1 do
for x = 0 to image.width - 1 do
let _ = read ~x ~y image in
()
done
done
in
let ms = (Unix.gettimeofday () -. start) *. 1000. in
Printf.printf "%s io took: %f ms \n" name ms
let () =
let image = Io_direct.make () in
run "direct" image Io_direct.read
let () =
let image = Io_record.make () in
run "record" image Io_record.read
The output it generates is
direct io took: 445.538998 ms
record io took: 66.195965 ms
Can you explain why the “record io” is so much faster?
Is there some optimization which only runs on records?
Wrap a value into a record adds an indirection. That mean that if you want an access to the underlying bigarray, an extra operation is needed (to dereference the pointer inside the record and finally load the bigarray).
You can see such indirection in assembler for this snippet for instance:
type t = { v : int }
let f0 t = t.v
let f1 v = v
And in assembler (x86_64):
camlExample__f0_8:
movq (%rax), %rax
ret
camlExample__f1_15:
ret
However, it exists a way to “optimize” your record. Indeed, in some situation, you need to wrap a value into a record, specially when you want the universal quantification (from the type system). In such case, you can use [@@unboxed] - be aware that it changes the caml representation then (so you need to be aware about such detail when you want to do some FFI stuffs).
Thanks for your reply, but I think you misread the question
The one wrapped inside a record is faster. I would also think that accessing the value inside the record and then accessing the bigarray would be slower.
Thats why it is so counter intuitive to me
So I take a look on the assembly generated and it seems that OCaml do something smart when it manipulate a record. A simple call to Array.unsafe_get emits a call to caml_ba_get_1 as expected, However, with a record, the assembly is a bit different when, in this path, we know that we manipulate an int32.
I think it’s mostly due to the int32 and the combination of:
And:
In other words, you should need to “specialize” Io_direct.read and its argument img to enforce the fact that it is a data - because Array1.unsafe_get does not constraint it to be a bigarray of int32. By this way, you can let the opportunity to OCaml to specialize the way to load a data from the given bigarray instead of a simple call to caml_ba_get_1 (which is “polymrophic”).
I would double-check on godbolt.org/ but my first intuition, based on
recent experiences, is that the record forces a monomorphic type on the
bigarray. Accessing a monomorphic bigarray is just a few assembly
instructions; a polymorphic bigarray will require a C call, I think.
Indeed, the read in Io_direct should fix the type of the bigarray (as given above, it has type x:int -> y:int -> ('a, 'b, 'c) Stdlib__bigarray.Array1.t img -> 'a, implying as @c-cube said, that the code generated goes through a generic C function). One way to fix the type is:
Wow! I wanted to say thank you. I have a performance sensitive program in OCaml using big array and adding a type annotation sped up the code ~30%! Is there a list of these little hints for optimization anywhere?