Unboxed Int32.t on 64-bit?

Is there a module somewhere that provides 32-bit integer semantics but backed by 64-bit underlying ints so it’s fast? I don’t mind if it falls back to the slower heap-based standard Int32 on 32-bit platforms.

Some cryptographic algorithms depend on 32-bit register semantics and it would be nice to be able to use pure OCaml versions that are also fast. Optint does something similar but not exactly this.

1 Like

If you do not care about 32-bit systems you can easily write a module UnboxedInt32 with the same signature as Int32 but implemented using regular ints (reducing all operations modulo 32 bits).

If you need to support 32-bit systems, you can do the same thing but use first-class modules to choose between UnboxedInt32 and the regular Int32 depending on the bitness of the system on which you are running.

Cheers,
Nicolas

1 Like

Not sure to understand your real use case but optint was made to use 63 bits integer (unboxed) or 32 bits integer depending on the platform with the same Int32 interface (plus some functions).

The purpose of it is to allow to use unboxed integer when it’s possible for adler32 or crc32 checkseum (see checkseum).

Then, the Int63 provided by @CraigFe is young but currently used by Irmin to, again, take the opportunity of the unboxed integer in 64-its platforms.

Keep in mind that ocamlopt is pretty good at keeping int32 (and int64 on 64-bit platforms) unboxed within a function. This slightly contrived function, for example, reads the parameter from memory once, does all the arithmetic in registers and then allocates to return the new int32.

Furthermore, you can:

  • Control inlining on a caller or callee basis with annotations: OCaml - Optimisation with Flambda
  • Use Bytes.{set,get}_int32_le to store multiple int32s unboxed in bytes (variants without bounds checks are available as externals)
1 Like

Hmm. I took a crack at what @nojb suggested (thanks!)

A naive straight-forward implementation using Int reduced to modulus 2^32 ended up being 4x slower than using Int32, though it was all because of the slowdown introduced by tacking a mod 2^32 on primitive operations.

Fortunately there’s a bitwise math trick that works on powers of 2 by the equivalence n % m == n & (m - 1). Doing this is yields a 20% speedup over Int32, even with the modulus, in this application.

I’m surprised it’s only 20% over Int32.t. Probably ocamlopt picked most of the low hanging fruit already, as @copy alludes :stuck_out_tongue:

1 Like

When I want to store int32 unboxed, I start with:

module Imm32 = struct
  include Sys.Immediate64.Make (Int) (Int32)

  let of_int32 : int32 -> t =
    match repr with
    | Immediate -> Int32.to_int
    | Non_immediate -> Fun.id

  let to_int32 : t -> int32 =
    match repr with
    | Immediate -> Int32.of_int
    | Non_immediate -> Fun.id
end

and just operate with int32, casting back and forth to Imm32.t at boundaries (fields, non-inlined function arguments/returns). As long as you ensure of_int32/to_int32 calls are inlined, I expect this to be more “compiler-friendly” than masking every intermediate operation.

Maybe I should consider a pair of prefix operators to cast back and forth, for ergonomics.

Note that depending on which operations you are using (eg bitwise right shifts, division) you may get a different result depending on whether you reduce every intermediate quantity or not.

Similarly, if you use Immediate64 then you may get different results whether you are in the Int case or in the Int32 case, since Int32 has “reduce every intermediate quantity” semantics but Int does not.

Cheers,
Nicolas

I don’t think you understood my suggestion. I write all (intermediate) operations over int32 and simply use Imm32.t for fields and function boundaries. That is:

let fma x y z =
  let open Int32_infix in
  Imm32.of_int32 ((Imm32.to_int32 x * Imm32.to_int32 y) + Imm32.to_int32 z)

Doing this actually introduces a performance hit.

 module type Int32_intf = sig
   type t

   val logand : t -> t -> t
   val take_lower_32_bits : t -> t
   val add : t -> t -> t
   val shift_right_logical : t -> int -> t
   val shift_left : t -> int -> t
   val logor : t -> t -> t
   val logxor : t -> t -> t
   val lognot : t -> t
   val of_int : int -> t
   val to_stdlib_int32 : t -> int32
   val of_char : char -> t
end

module Boxed_int32 : Int32_intf = struct
   include Int32

   let take_lower_32_bits x = x
   let to_stdlib_int32 x = x
   let of_char c = of_int (Char.code c)
end

module Unboxed_int32 : Int32_intf = struct
  (* Provides unboxed 32-bit operations on 64-bit platforms.
      Must do [take_lower_32_bits x] before final storing or bitwise shifts to
      avoid pulling stuff in from the higher 32-bits.

      Fast bitwise modulus of powers of 2: n % m == n & (m - 1) *)
  type t = int

  let raise_2_to_32 = Int.of_float (2. ** 32.)
  let raise_2_to_32_minus_1 = pred raise_2_to_32
  let logand a b = Int.logand a b
  let take_lower_32_bits x = logand x raise_2_to_32_minus_1
  let add a b = Int.add a b
  let shift_right_logical a b = Int.shift_right_logical a b
  let shift_left a b = Int.shift_left a b
  let logor a b = Int.logor a b
  let logxor a b = Int.logxor a b
  let lognot x = Int.lognot x
  let of_int x = take_lower_32_bits x
  let to_stdlib_int32 x = Stdlib.Int32.of_int x
  let of_char c = Char.code c
end

let unboxed_int32_module = (module Unboxed_int32 : Int32_intf)
let boxed_int32_module = (module Boxed_int32 : Int32_intf)

module Int32 = 
   (val match Sys.word_size with
   | 32 -> boxed_int32_module
   | 64 -> unboxed_int32_module
   | bits -> failwith (Printf.sprintf "Sys.word_size: unsupported: %d" bits))

I’m not sure, but I think it’s because it can’t inline through the module interface?

If I just drop the Int32_intf module type annotations and say module Int32 = Boxed_int32 directly, it’s about 30% faster.

Indeed module type coercion is a runtime operation (essentially a record copy) and inlining will not work well there (maybe Flambda can do better, I don’t know). In any case, probably a better solution is to follow @debugnik’s suggestion and use Sys.Immediate64.Make(Int)(Int32) and build your module out of that, it was designed for this use case after all.

Just to spell out how it works, suppose you define

module BoxedUnboxedInt32 : sig
  type t
  val of_int32 : int32 -> t
  val to_int32 : t -> int32
  val add : t -> t -> t
end = struct
  include Sys.Immediate64.Make(Int)(Int32)
  let of_int32 (n : int32) : t =
    match repr with Immediate -> Int32.to_int n | Non_immediate -> n
  let to_int32 (a : t) : int32 =
    match repr with Non_immediate -> a | Immediate -> Int32.of_int a
  let add (a : t) (b : t) : t =
    match repr with Immediate -> a + b | Non_immediate -> Int32.add a b
end

Each time you pattern match on repr (a value defined in the result of the Sys.Immediate64.Make functor) you are able to refine the type t defined by the same functor into either an Int.t = int or a Int32.t = int32. I believe that if you use the native-code compiler you should see almost no performance penalty for the resulting code.

Cheers,
Nicolas

1 Like

BoxedUnboxedInt32.add has slightly different semantics from Int32.add for addition overflow semantics?

Not if the only operation you expose is add :slight_smile: (as the value will be reduced when passing through to_int32). But yes, in general you have to be careful to either reduce every intermediate result modulo 32 bits or at least doing so before they are used in operations where this makes a difference (eg right shifts).

Cheers,
Nicolas

Note that if you go that route (use a different module depending on Sys.word_size) you can use dune to link one implementation or the other. Then there’s nothing to inline :slight_smile:

For example:

(rule
 (enabled_if (= %{arch_sixtyfour} true))
 (copy# impl/unboxed_int32.ml int32.ml))

(rule
 (enabled_if (= %{arch_sixtyfour} false))
 (copy# impl/boxed_int32.ml.in int32.ml))

I’m curious to find why the performance hit occurs. Do you think you could share a sample microbenchmark that gets a performance hit, and the steps you use to compile it (in particular, is it all in a single file or do you compile the int32 lib and the test code separately) ?

I’m reasonably confident that Flambda will compile this code efficiently, but I’m a bit surprised that Closure (the non-Flambda version) cannot.

Note that int_size is a runtime property. When producing bytecode, int_size will depend on the ocamlrun interpreter. Similar thing happen when using js_of_ocaml-compiler. Relying on %{arch_sixtyfour} might not always work.

Ah, I didn’t know that! Thanks!

Sure, here you go!

I’m using 4.08.1+flambda. When I build and run with:

dune build --profile=release test/test_sha256.exe
time test/test_sha256.exe < /really/big/file

I get one time. If I then go into mycrypto/sha256.ml and eliminate the first class module and interfaces (including the module type annotations) and just say module Int32 = Unboxed_int32, it runs in about 0.7x the time.

Thanks for the code !

I’ve investigated a bit and I think most (if not all) of the performance hit is due to the use of Int32.t arrays. With the module type annotations, the compiler can’t know the exact type of the elements of the arrays, and will generate generic array operations, while if you remove the interfaces and use module Int32 = Unboxed_int32, it will generate integer array operations, which are more efficient. Using a compiler configured with -no-flat-float-array helps a bit, but there remains a noticeable difference between the version with and without the type annotations.

I did also check whether the first-class modules and dispatch on Sys.word_size made any difference, and they do not seem to introduce any overhead: both Closure and Flambda have the same performance profile whether the first-class modules are used or not (as long as Unboxed_int32 is defined with a module type annotation). Flambda is about 20~25% faster than Closure, but it’s likely only due to the more aggressive inlining in the rest of the code.

Overall, that was an interesting problem. I wouldn’t have thought about the impact on arrays from the initial example only, so thanks again for sharing the code.

The next step to try to recover performance could be to define an Array submodule in the Int32_intf module type with all the array operations you need, and make sure in the implementations to annotate the arguments to force the use of the specialised primitives:

module type Int32_intf = sig
  type t

  val logand : t -> t -> t
  val take_lower_32_bits : t -> t
  val add : t -> t -> t
  ...
  module Array : sig
    val make : int -> t -> t array
    val get : t array -> int -> t
    ...
  end
end

module Unboxed_int32 : Int32_intf = struct
  type t = int

  let raise_2_to_32 = 1 lsl 32 (* Equivalent to [Int.of_float (2. ** 32.)] *)
  let raise_2_to_32_minus_1 = pred raise_2_to_32
  let logand a b = Int.logand a b
  let take_lower_32_bits x = logand x raise_2_to_32_minus_1
  let add a b = Int.add a b
  ...
  module Array = struct
    let make size elt = Array.make size (elt : t)
    let get arr idx = Array.get (arr : t array) idx
    ...
  end
end

module Int32 =
  (val match Sys.word_size with
     | 32 -> boxed_int32_module
     | 64 -> unboxed_int32_module
     | bits -> failwith (Printf.sprintf "Sys.word_size: unsupported: %d" bits))

module Array = Int32.Array (* hack: overriding the Array module allows the
                              a.(i) syntax to work directly *)

I haven’t tried it, but it should get you the same performance as using the unannotated Unboxed_int32 module directly.
If you actually try it, please let me know if it works !

1 Like

Implemented in this changeset as you suggest.

This seems to retain the good performance but also allows runtime selection of the module with a safe fallback on 32 bit. I kinda don’t believe it? :slight_smile: In my mental model using first class modules this way should at least introduce a function pointer dereference around every call? And there’s a lot of these calls? Hmm.

Anyway, I almost certainly never would have found this generic versus specific integer array thing on my own. Thanks for the insightful analysis!

As it stands, my implementation of SHA256 is about 2.4x slower than the meticulously hand-optimized C version that ships in Linux coreutils /usr/bin/sha256sum.

I expect, to gain further speedup I’d have to hand unroll the loop and avoid the register shuffling, as the C version does. I’ll probably skip trying to do that. Only 2.4x slower than hand optimized C is pretty good.

Nice to know that it actually worked ! The reason why there’s no extra dereference is that Sys.word_size is a compile-time constant in the native backend. So the code match Sys.word_size with ... | 64 -> unboxed_int32_module | ... simplifies to unboxed_int32_module. Also, in the backend all modules (regular ones and first-class ones) are handled in the exact same way, so this ends up as if you had written module Int32 = Unboxed_int32. In particular, all Int32.foo functions can be uniquely resolved, which allows direct calls and inlining.

If you have time to spare, you might want to check the ocplib-endian library (it’s on opam). It exposes compiler primitives to read and write numbers from raw strings (and bytes). It’s possible that encoding 32-bit integer arrays using strings or bytes could help you shave a bit more time.
But only 2.4x slower is already nice :slight_smile:

1 Like