Vectorized operations for float arrays or bigarrays of floats

Is there anyone who ever implemented vectorized
operations for float arrays or Bigarray.Array1 of floats?
I guess, via some C code calling specialised “multimedia” CPU instructions.

In some simulation code, translating a molecule is just adding
a small constant to all X values, another constant to Y values, another constant to Z values
(if we have a struct-of-arrays architecture).
So, at some point, I might be interested by such optimizations.

If you have another suggestion (like using the OCaml bindings to BLAS/LAPACK or I don’t know what yet), just let me know.

Thanks,
F.

3 Likes

Translating all atoms at once sounds like it could benefit from array-of-structs and then calling daxpy from BLAS via lacaml? One probably needs to weigh the benefit with potentially more cumbersome access to atoms.

At LexiFi we use 1-dimension bigarrays for most of our numerical code. As far as vectorization is concerned, we mostly use two techniques:

  • We have many small C primitives working on arrays of doubles, eg

    void mlfi_mul(int size, double *src1, double *src2, double *dst) {
       for (int k = 0; k < size; k ++) {
            dst[k] = src1[k] * src2[k];
        }
    }
    

    and then rely on GCC’s auto-vectorizer (enabled when compiling with -O3 or by passing -ftree-vectorize). But note that some vectorizations require specific processor extensions and may fail at runtime if the processor does not support the necessary instructions. At LexiFi we actually compile the same code for several instructions sets and do a runtime check to decide what code to use. But depending on your needs, you may not have to go through this extra complication and just assume that your target processor implements the necessary instructions.

  • We also wrote bindings for the MKL library which insulates you from the different instruction sets; and my impression (but am not an expert) is that it brings some more advanced optimizations than what you can get naïvely from GCC.

Cheers,
Nicolas

4 Likes

I’ve written a vectorized bitvector library (GitHub - jfeser/bitarray: Fast vectorized bitarrays for OCaml). It uses ISPC to implement the kernels (although gcc does a great job with simple kernels). ISPC deals with the vector instructions, and it can emit code that selects the right kernel at runtime, depending on which instructions are available. Doing something similar with float arrays should be straightforward.

All that said, for the uses you describe, BLAS sounds like the right choice.

1 Like

This sounds like exactly what Owl is for, doesn’t it? It is built on BLAS/LAPACK.

2 Likes

So, even LexiFi people use k++ when they mean ++k ? :wink:

1 Like

Did you release a library exposing such bigarray accelerated operations?
I would be happy to give it a try.

Antoine Mine suggested me to use C code working on bigarrays, as you do, and gcc options
such as -O3, or even -O3 -mtune=native -march=native.
This sounds like a very simple and pragmatic suggestion.

1 Like

Owl is a rather large dependency. I know we have bindings to BLAS/LAPACK due to Markus Mottl, I could use those directly and not require Owl.

MKL is not free, right?

Yes, it is free (including for commercial use), but not open-source.

Cheers,
Nicolas

Owl_base is built exactly for this purpose.

1 Like

I use lacaml, which is easy to use and install, with very few dependencies.
Lacaml can use MKL by setting environment variables when compiling. This worked 2 years ago (but I don’t know if it still works):

export LACAML_LIBS="-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_rt -lpthread -lm -ldl"
opam install lacaml
3 Likes

I had shared some AVX2 intrinsics directly exposed in the library here - bigstringaf/bigstringaf_simd.mli at master · anmolsahoo25/bigstringaf · GitHub

I haven’t used the library in a while + I think it was mentioned that since the function signature has tuples, it would lead to unnecessary allocations. I will probably revisit the library sometime in the future.

1 Like