Thanks for the code !
I’ve investigated a bit and I think most (if not all) of the performance hit is due to the use of
Int32.t arrays. With the module type annotations, the compiler can’t know the exact type of the elements of the arrays, and will generate generic array operations, while if you remove the interfaces and use
module Int32 = Unboxed_int32, it will generate integer array operations, which are more efficient. Using a compiler configured with
-no-flat-float-array helps a bit, but there remains a noticeable difference between the version with and without the type annotations.
I did also check whether the first-class modules and dispatch on
Sys.word_size made any difference, and they do not seem to introduce any overhead: both Closure and Flambda have the same performance profile whether the first-class modules are used or not (as long as
Unboxed_int32 is defined with a module type annotation). Flambda is about 20~25% faster than Closure, but it’s likely only due to the more aggressive inlining in the rest of the code.
Overall, that was an interesting problem. I wouldn’t have thought about the impact on arrays from the initial example only, so thanks again for sharing the code.
The next step to try to recover performance could be to define an
Array submodule in the
Int32_intf module type with all the array operations you need, and make sure in the implementations to annotate the arguments to force the use of the specialised primitives:
module type Int32_intf = sig
val logand : t -> t -> t
val take_lower_32_bits : t -> t
val add : t -> t -> t
module Array : sig
val make : int -> t -> t array
val get : t array -> int -> t
module Unboxed_int32 : Int32_intf = struct
type t = int
let raise_2_to_32 = 1 lsl 32 (* Equivalent to [Int.of_float (2. ** 32.)] *)
let raise_2_to_32_minus_1 = pred raise_2_to_32
let logand a b = Int.logand a b
let take_lower_32_bits x = logand x raise_2_to_32_minus_1
let add a b = Int.add a b
module Array = struct
let make size elt = Array.make size (elt : t)
let get arr idx = Array.get (arr : t array) idx
module Int32 =
(val match Sys.word_size with
| 32 -> boxed_int32_module
| 64 -> unboxed_int32_module
| bits -> failwith (Printf.sprintf "Sys.word_size: unsupported: %d" bits))
module Array = Int32.Array (* hack: overriding the Array module allows the
a.(i) syntax to work directly *)
I haven’t tried it, but it should get you the same performance as using the unannotated
Unboxed_int32 module directly.
If you actually try it, please let me know if it works !