Is there anyone who ever implemented vectorized
operations for float arrays or Bigarray.Array1 of floats?
I guess, via some C code calling specialised “multimedia” CPU instructions.
In some simulation code, translating a molecule is just adding
a small constant to all X values, another constant to Y values, another constant to Z values
(if we have a struct-of-arrays architecture).
So, at some point, I might be interested by such optimizations.
If you have another suggestion (like using the OCaml bindings to BLAS/LAPACK or I don’t know what yet), just let me know.
Translating all atoms at once sounds like it could benefit from array-of-structs and then calling daxpy from BLAS via lacaml? One probably needs to weigh the benefit with potentially more cumbersome access to atoms.
At LexiFi we use 1-dimension bigarrays for most of our numerical code. As far as vectorization is concerned, we mostly use two techniques:
We have many small C primitives working on arrays of doubles, eg
void mlfi_mul(int size, double *src1, double *src2, double *dst) {
for (int k = 0; k < size; k ++) {
dst[k] = src1[k] * src2[k];
}
}
and then rely on GCC’s auto-vectorizer (enabled when compiling with -O3 or by passing -ftree-vectorize). But note that some vectorizations require specific processor extensions and may fail at runtime if the processor does not support the necessary instructions. At LexiFi we actually compile the same code for several instructions sets and do a runtime check to decide what code to use. But depending on your needs, you may not have to go through this extra complication and just assume that your target processor implements the necessary instructions.
We also wrote bindings for the MKL library which insulates you from the different instruction sets; and my impression (but am not an expert) is that it brings some more advanced optimizations than what you can get naïvely from GCC.
I’ve written a vectorized bitvector library (GitHub - jfeser/bitarray: Fast vectorized bitarrays for OCaml). It uses ISPC to implement the kernels (although gcc does a great job with simple kernels). ISPC deals with the vector instructions, and it can emit code that selects the right kernel at runtime, depending on which instructions are available. Doing something similar with float arrays should be straightforward.
All that said, for the uses you describe, BLAS sounds like the right choice.
Antoine Mine suggested me to use C code working on bigarrays, as you do, and gcc options
such as -O3, or even -O3 -mtune=native -march=native.
This sounds like a very simple and pragmatic suggestion.
I use lacaml, which is easy to use and install, with very few dependencies.
Lacaml can use MKL by setting environment variables when compiling. This worked 2 years ago (but I don’t know if it still works):
I haven’t used the library in a while + I think it was mentioned that since the function signature has tuples, it would lead to unnecessary allocations. I will probably revisit the library sometime in the future.