Some SIMD in your OCaml

Fresh from a weekend of hacking, I would like to share some results of an experiment I conducted of creating a library for exposing Intel AVX2 intrinsics to OCaml code. AVX2 is an instruction set subset that adds data-parallel operations in hardware.

I chose to fork the amazing bigstringaf library and modified it. You can find the additions to the code here - bigstringaf_simd.

Overview

Given a type Bigstring.t (1 dimensional byte arrays) there now exist functions such as -

val cmpeq_i8 : (t * int) -> (t * int) -> (t * int) -> unit

So cmpeq_i8 (x,o1) (y,o2) (z,03) will compare 32 bytes starting at o1 and o2 from x and y respectively and store the result in z at o3.

Why?

This was mainly an exercise in curiosity. I just wanted to learn whether something like this is viable. I also want to see if adding some type-directed magic + ppx spells can let us write data parallel code much more naturally similar to what lwt / async did for async code.

At the same time, you might ask - why not use something like Owl (which already has good support for data-parallel operations)? Apart from the fact that such libraries are oriented towards numerical code, I would also like to explore if we can operate directly on OCaml types and cast them into data parallel algorithms. Like how simdjson pushed the boundaries of JSON parsing, it would be nice to port idiomatic code to data-parallel versions in OCaml. Can we, at some point, have generic traversals of data-types, which are actually carried out in a data-parallel fashion?

Does it work?

Given the limitation of the current implementation (due to foreign function calls into C), I still found some preliminary results to be interesting! Implementing the String.index function, which returns the first occurence of a char, the runtime for finding an element at the n-1 position in an array with 320000000 elements is -

serial: 1.12 seconds
simd: 0.72 seconds (1.5x)

I still have to do the analysis what the overhead of the function call into C is (even with [@@noalloc]!

Future directions

It would be interesting to see, if we can create a representation which encapsulates the various SIMD ISA’s such as AVX2, AVX512, NEON, SVE etc. Further more, it would be really interesting to see if we can use ppx to automatically widen map functions to operate on blocks of code, or automatically cast data types in a data parallel representation.

Disclaimer

This was mostly a hobby project, so I cannot promise completing any milestones or taking feature requests etc. I definitely do not recommend using this in production, because of the lack of testing etc.

19 Likes

Quite interesting project.

Thank you! Very early days, hopefully I can get it to a useful form soon!

Do you have any kind of vision about which class of programs may benefit for fancy amd64 instructions? I mentioned autovectorization topic in a chatroom somewhere on ICFP2020. Folks from Janestreet said that it’s unlikely that they are writing code that could benefit from this kind of instructions.

1 Like

Isn’t it supposed to be useful even for simple memcpy or string search? I don’t know if it’s the fancy newer instructions, but OCaml could certainly use faster copy/blit/index on bytes, too.

1 Like

@Kakadu Though their are quite a few fancy instruction set subsets in amd64, I think SIMD instruction sets like AVX / AVX2 / AVX512 are becoming more common now. If you look at DPDK, which does userspace network packet manipulation, simdjson which does JSON parsing and Hyperscan for regex matching, all of them benefit from SIMD.

The problem I feel here is that SIMD has always been targeted towards math-heavy code, but I do feel there is some potential for defining generic maps / traversals etc over OCaml ADT’s that can benefit from SIMD. That would possibly open-up any code benefiting from SIMD (but this is a pipe-dream!)

At the same time, I think if you are optimizing for latency (which in my WILD GUESS, Jane Street is probably interested in), then i don’t think SIMD helps much there. Wherever throughput is a concern, SIMD would be more helpful.

@c-cube For the case of copy / blit, I think it would be hard to beat the system memcpy / memmove. There is an example here FastMemcpy, which uses AVX to beat memcpy/memmove but I don’t think the effort is worth (maybe?). At the same time, the index function is quite interesting, because by using a vectorized version which goes through a c call, I am able to get a 1.5x speedup. Would be interesting to see if the performance benefit if that instruction can be emitted inline.

Yes, even calling the C memcpy for bytes functions would often result in a performance improvement, imho. I’m not sure if the compiler maintainers are willing to add these instructions to the code generator, or any instruction that goes beyond x86_64.

(I’d love to have popcnt otherwise…)

The bigstringaf library already implements blits / copies as memcpy / memmove!

1 Like

Indeed, but bytes could also use some love!

2 Likes

I’m a bit puzzled by your comment here. As far as I can tell, Bytes.blit actually calls memmove currently, cf https://github.com/ocaml/ocaml/blob/trunk/stdlib/bytes.ml#L36 and https://github.com/ocaml/ocaml/blob/trunk/runtime/str.c#L363 (the reason for using memmove and not memcpy is to correctly handle overlapping source and destinations iirc).

I’m a bit puzzled by your comment here. As far as I can tell, Bytes.blit actually calls memmove currently, cf https://github.com/ocaml/ocaml/blob/trunk/stdlib/bytes.ml#L36 and https://github.com/ocaml/ocaml/blob/trunk/runtime/str.c#L363 (the reason for using memmove and not memcpy is to correctly handle overlapping source and destinations iirc).

I stand corrected :slight_smile:

In practice memmove and memcpy are the same at least in glibc, as far as
I know. It’s good to know! My point still stands for index, but it’s
weaker now.

1 Like

I think the reality is that outside of numeric computation, it’s pretty tough to use SIMD in a functional language. Functional languages just don’t match up with SIMD data patterns, which are array-based. A blob of string data is not a useful data type for OCaml – it immediately needs to be transformed to a more convenient data structure. On the other hand, a blob of numeric data is often well structured with predefined data lengths, and so can be carried around or transformed with computation occurring as map iterations over the data, which is what Owl does.

2 Likes

I could imagine a specific array library, or even the Bigarray library (or a specific version of it), exposing some functions to exploit SIMD instructions.
This being said, I’m going to test this to see if GPUs are of any use to me:
https://futhark-lang.org/

This is no longer true since 2010. (At the time, this caused some severe bugs all over the place, when people suddenly discovered they could no longer incorrectly use memcpy to perform overlapping moves.) More precisely, here is the definition of memmove. This looks identical to memcpy, except that the macro USE_AS_MEMMOVE is set, which adds some checks for overlapping. (And this macro is unset for memcpy).

#define USE_AS_MEMMOVE
#define MEMCPY                __memmove_ssse3
#define MEMCPY_CHK        __memmove_chk_ssse3
#include "memcpy-ssse3.S"

Modern C and C++ compilers (like gcc and llvm) make heavy use of SIMD instructions on architectures that provide them because they produce serious speedups. Perhaps the OCaml compiler could do the same.

2 Likes

I don’t think this is necessarily true. There’s this neat paper from a few years ago that extended a stream type with simd support and got a ~2x performance improvement. The interface is still nicely functional; all the simd details are hidden.

2 Likes

Agreed @jfeser, I don’t see any reason why FP patterns cannot be mapped to SIMD programming. At the end of the day everything can be hidden as an abstract datatype! :stuck_out_tongue: Thanks for the paper suggestion!

@perry, not a big fan of compilers rolling SIMD / vector compilers. Too much complexity to effectively manage and it is definitely not a common enough application to warrant support for everyone. A library approach to SIMD is much more tenable, in my opinion.

All modern processors do SIMD, and somehow, people using languages like C++ manage to write compilers that handle the complexity. The result is that they get enormous performance wins because single thread performance has stalled out on modern processors and explicit parallelism is how people now get speed increases.

If you play with the output of compilers like gcc and clang/LLVM, you will see that vector instructions are inserted into pretty much every sort of program imaginable. This is because it substantially improves performance, even in everyday code. You cannot get this sort of benefit from using SIMD instructions in libraries in very limited contexts.

1 Like

This is the exception that proves the rule IMO. Note the data type they use: Vector. At some point, data has to occupy adjacent memory cells to be viable for SIMD operations. This is not typical of FP data structures at all.

In a language with monitored side effects (such as Haskell), stream fusion allows several map/fold operations to be strung together, much like we do for the Iter or Seq type in OCaml. You can then unroll a few of these loops to use up the SIMD registers. But notice that this isn’t very different from simply using tensor data types, and manipulating them with operations that adjust the whole tensor. In both cases, the main target is numerical data, and in both cases, you end up manipulating batches of data. In fact, I would suggest that the latter representation is generally superior in this case, since the user is aware that he is manipulating batches of data and therefore thinking at a higher conceptual level. This paper happens to have been written before the explosion of numpy (and its eventual arrival to OCaml in the form of owl), and therefore still discusses at the level of manipulating individual numbers, when the world has moved on to thinking about large groups of numbers and their manipulation (for machine learning and other mathematical applications). Again, note that this is a very specific niche – purely numeric data, organized together in large, regular batches.

1 Like

Nowadays, instruction sets provide gather-load and scatter-store instructions, so they no longer require adjacent memory cells (since 2013 for x86, since 2016 for ARM).

Theoretically, gather-scatter instructions are meant to load/store disjoint elements from a single array. But nobody prevents you from having the array start at address zero and cover the whole address space, so that the indices are actual memory addresses.

But to be honest, compilers are just bad at recognizing gather-scatter operations when vectorizing, so one has to manually write such accesses using intrinsics in practice.

1 Like