Fresh from a weekend of hacking, I would like to share some results of an experiment I conducted of creating a library for exposing Intel AVX2 intrinsics to OCaml code. AVX2 is an instruction set subset that adds data-parallel operations in hardware.
I chose to fork the amazing bigstringaf library and modified it. You can find the additions to the code here - bigstringaf_simd.
Overview
Given a type Bigstring.t
(1 dimensional byte arrays) there now exist functions such as -
val cmpeq_i8 : (t * int) -> (t * int) -> (t * int) -> unit
So cmpeq_i8 (x,o1) (y,o2) (z,03)
will compare 32 bytes starting at o1
and o2
from x
and y
respectively and store the result in z
at o3
.
Why?
This was mainly an exercise in curiosity. I just wanted to learn whether something like this is viable. I also want to see if adding some type-directed magic + ppx spells can let us write data parallel code much more naturally similar to what lwt / async
did for async code.
At the same time, you might ask - why not use something like Owl (which already has good support for data-parallel operations)? Apart from the fact that such libraries are oriented towards numerical code, I would also like to explore if we can operate directly on OCaml types and cast them into data parallel algorithms. Like how simdjson
pushed the boundaries of JSON parsing, it would be nice to port idiomatic code to data-parallel versions in OCaml. Can we, at some point, have generic traversals of data-types, which are actually carried out in a data-parallel fashion?
Does it work?
Given the limitation of the current implementation (due to foreign function calls into C), I still found some preliminary results to be interesting! Implementing the String.index
function, which returns the first occurence of a char, the runtime for finding an element at the n-1
position in an array with 320000000
elements is -
serial: 1.12 seconds
simd: 0.72 seconds (1.5x)
I still have to do the analysis what the overhead of the function call into C is (even with [@@noalloc]
!
Future directions
It would be interesting to see, if we can create a representation which encapsulates the various SIMD ISA’s such as AVX2, AVX512, NEON, SVE etc. Further more, it would be really interesting to see if we can use ppx to automatically widen map
functions to operate on blocks of code, or automatically cast data types in a data parallel representation.
Disclaimer
This was mostly a hobby project, so I cannot promise completing any milestones or taking feature requests etc. I definitely do not recommend using this in production, because of the lack of testing etc.