[ANN] carray.0.0.1

Hi everyone,

I’m glad to announce the first (experimental) release of ocaml-carray, a library mocking the Array interface to work with contiguous C array.
Disclaimer: this first version is experimental and must be used with caution. A restricted set of values are supported at the moment (custom block with no out-of-heap values). Depending on the demand, more values might be supported.
Feel free to use this thread to suggest ideas, make opinions, etc.

Repository: Danny Willems / ocaml-carray · GitLab
License: MIT
Release: 0.0.1
Documentation: https://dannywillems.gitlab.io/ocaml-carray/carray/index.html
Nomadic Labs website: https://nomadic-labs.com
Tezos ZK-rollups repository: Nomadic Labs / privacy-team · GitLab

Motivation

OCaml arrays are not always contiguous piece of memory, requiring accessing different chunks of memory when accessing individual elements. When requiring a value in memory, the CPU will fetch the RAM and load not only the particular value but a memory page (a contiguous piece of memory) and add it to its cache. The CPU will use its cache to load the values in its registers. It is not efficient with large OCaml arrays as the CPU will constantly fetch the RAM to load different memory pages in its cache.
Also, when using the C FFI, the user must know the memory representation of an array and use the non user-friendly low-level interface macro Field.

This work

This library provides a polymorphic interface mocking a subset of the Array interface to work with contiguous piece of memory. Using the library should be as easy as adding module Array = Carray.
A C macro Carray_val is also provided for developers writing bindings and requires to simply cast in the underlying C type.
It has also been observed sub arrays are sometimes used for read-only operations. However, Array.sub allocates a fresh copy of the requested sub part. Carray leverages this memory cost by providing noalloc variants, like sub_noalloc.

Performances

The concept has been tested and used in real world applications like the polynomial library used by Nomadic Labs to implement zk-rollups. A speed up of around 50% has been observed when using contiguous arrays compared to OCaml arrays to compute NTT/FFT.

Usage

This library is experimental. Use this library with caution. The interface might change in the future.

opam install carray.0.0.1
10 Likes

could you briefly tell what you did different from Bigarray? (that one has horrible perf, at least when I tried using it).
Also TIL that OCaml may chunk-down an array.

FYI Bigarray has very good performance but one must be aware of its characteristics. Allocating and deallocating the array is relatively expensive, but it is not scanned by the GC (good for large arrays) and computation within the array happens in C code and can therefore be more efficient, for example by using vector instructions.

I’d love to see an example of Bigarray used for OCaml-side perf rather than FFI or memory mapping, so that I can better understand how to use it properly.
From what I understand, C calls are usually more expensive than OCaml calls… And AFAIK ocamlrun is compiled with -O2 and no -march/tune etc… so I don’t think cc would emit any vectorized instructions for the runtime… Or does it?

1 Like

The main example I can point to is Owl, which is a numpy-like library for OCaml. If you want to do any kind of vector math, image processing, scientific computing, deep learning and heavy floating-point computation, you want to use Owl, as it will utilize BLAS and LAPACKE vectorized operations over Bigarrays.

2 Likes

Note that you can, to a certain degree, build your own flat structures with the Bytes module. Compared to bigarrays, Bytes.t has less indirection, a lower constant memory overhead and can be allocated on the minor heap. The contents of Bytes.t are not scanned by the GC, just like bigarrays.

For example, a more efficient int32 Array.t:

module Int32_array : sig
  type t
  val equal : t -> t -> bool
  val create : int -> t
  val length : t -> int
  val get : t -> int -> int32
  val set : t -> int -> int32 -> unit
  val sub : t -> int -> int -> t
  val to_list : bytes -> int32 list
end = struct
  type t = Bytes.t
  let equal = Bytes.equal
  let create len = Bytes.create (4 * len)
  let length t = Bytes.length t / 4
  let get t i = Bytes.get_int32_le t (4 * i)
  let set t i x = Bytes.set_int32_le t (4 * i) x
  let sub t pos len = Bytes.sub t (4 * pos) (4 * len)
  let to_list t = List.init (length t) (get t)
end

A more efficient (int * int):

module Point : sig
  type t
  val create : int -> int -> t
  val x : t -> int
  val y : t -> int
end = struct
  external get_int64_unsafe : bytes -> int -> int64 = "%caml_bytes_get64u"
  external set_int64_unsafe : bytes -> int -> int64 -> unit = "%caml_bytes_set64u"
  type t = Bytes.t
  let create x y =
    let p = Bytes.create 16 in 
    set_int64_unsafe p 0 (Int64.of_int x);
    set_int64_unsafe p 8 (Int64.of_int y);
    p
  let x t = Int64.to_int (get_int64_unsafe t 0)
  let y t = Int64.to_int (get_int64_unsafe t 8)
end

(making a more efficient (int * int) Array.t is left as an exercise to the reader)

The downside compared to bigarrays is that it doesn’t support sub without copying. Also, bytes can be moved by the GC (during minor GCs or compaction), and therefore you cannot release the runtime lock when passing them to C. The latter point is less relevant with the multicore extensions, especially since there is no compactor yet. There is some related discussion on the eio repository: Decide on cstruct vs bytes · Issue #140 · ocaml-multicore/eio · GitHub

7 Likes

Is there a way we can place this array in a shared memory?
If so, this could be used for IPC:

  • the writer process creates and populates the carray
  • forked processes can read into it.

The documentation link leads to your gitlab, which requires a login to view the page ? Perhaps is there a way so that random readers might not need to create an account?

Imagine a reading process that copies a value from the shared memory by referring to it by pointer (a common thing in OCaml). It now has access to a value that can be erased outside of its control (since the GC is run in the writer process). The only reason BigArrays can work in this way is that primitive values can only be copied into the OCaml processes that access it.

Thanks for the report. Should be fixed with Fix doc publishing (32f4443c) · Commits · Danny Willems / ocaml-carray · GitLab. The link should be accessible in some minutes.
Note: nothing exciting in the doc yet. It is simply a copy of Array.mli from ocaml/ocaml.

I don’t know much about shared memory and IPC, but I can have a look later. It looks like the memory must be allocated with another primitive than malloc. At the moment, the library simply works by allocating (Caml) out-of-heap values using malloc, and a type 'a Carray.t is a pointer to this out-of-heap memory.
EDIT: how do we deal with the GC?

I haven’t compared the performance to Bigarrays or other implementations (there is one implementation in Ctypes, but never gave a shot). I remember Ctypes adds lot of information in the type (to be very generic), which can induce performance issues.
Also, AFAIK, Bigarrays only works for integers and floats. Introducing int and float in Carray won’t be that difficult. I think about adding a small tag in the type 'a Carray.t to distinguish the underlying type to decide which runtime _val macro to use and I would give a better answer later. I won’t be surprised Bigarray implementation uses this idea or a more complex one. Nativeint, Int32 and Int64 can already be used as up to now (OCaml 4.14), they are implemented as custom blocks. However, while adding tests for them (see Files · add-int32-int64-nativeint · Danny Willems / ocaml-carray · GitLab), I am getting out of memory errors. I would give a look later.

Another note about performance, I realised using Carray.iter gives worse performance (by 40%) than by using Caml array (see examples directory). I suppose there is an optimisation in the Caml compiler when only Caml code is used. And also, there is an allocation performed in the library for each element. I haven’t got time to think about a better implementation.

Carray has been built initially to store custom blocks and use the structure in C bindings. If the idea can be implemented in other existing libraries supporting other types with the same performance, I am happy to merge the project to avoid duplication.

1 Like

using bytes for perf is an old but gold trick in microbenchmarks drag races i had with friends lol

1 Like

I was thinking about a way for processes to communicate.
I was not looking for a way to synchronize processes.
And yes, when you communicate via a shared memory, your processes should better know what they are doing and when to do it.

What I tried to explain is that it’s fundamentally unsafe.

Would you advise me to use your library to store an array of Vector3.t points?

 type Vector3.t { x : float; y : float; z : float; }

Or, is it stupid because an array of such things would already be optimized
by the compiler to something more efficient (because in the end it is just
an array of only floats)?

Regards,
F.

you might be missing unsafe_get / unsafe_set for people who really know what they are doing. :wink: