Issues using OpenMP in Owl

Recently I’m looking at using OpenMP in the numerical library Owl. To improve efficiency, part of the core functions in Owl are implemented in C. For example, to perform sin function on a vector:

open Owl
module N = Dense.Ndarray.S

let _ = 
let n = 1000000 in 
let x = N.ones [|n|] in 
N.sin x

In the end it will call the C code in owl_ndarray_maths_map.h, where the sin function is applied to each element in the array in a for-loop. Here OpenMP can be used to improve the speed of the for-loop by assigning sin operations to multiple threads, as shown in owl_ndarray_maths_map_omp.h. So ideally more threads will leads to faster speed.

However, on my laptop it shows the opposite behaviour: when I turn on the openmp switch in compiling Owl and run the aforementioned OCaml code, the running time of N.sin x actually grows with more threads in OpenMP, as shown here:

Running%20sin%20function%20on%20vector%20in%20Owl

The result suggests that using more threads only adds fixed amount of overhead, without performing multi-thread work. To check if this problem is caused by the OpenMP on my machine itself, I create a short C script to do the same computation:

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
static long N = 10000000;
#define NUM_THREADS 4

int main ()
{
    float start_time, run_time;
    float *x = (float *) calloc(N, sizeof(float));
    float *y = (float *) calloc(N, sizeof(float));
    if (x == NULL || y == NULL) exit(1);
    memset(x, 1, sizeof(x));
    for (int i = 0; i < N; ++i)
        x[i] = i;
             
    start_time = omp_get_wtime();

    omp_set_num_threads(NUM_THREADS);
    #pragma omp parallel for schedule(static)
    for (int i = 0; i < N; ++i) {
        float tmp = x[i];
        y[i] = sinf(tmp);
    }
    run_time = omp_get_wtime() - start_time;
    printf("\nRunning Time is %f seconds\n ", run_time);
}     

By compiling it with gcc -fopenmp sin.c -lm on the same machine, this C code, however, shows the expected performance, which is more threads leading to faster speed. I checked that both the OCaml native code and the C executable are linked to the same libgomp.so file by using ldd.

Also, this issue can be re-created on another Ubuntu machine of mine, but not on Mac machines.

In compiling Owl, I remove all the c flags to avoid any possible interference. Though not very likely, but I was thinking if jbuilder/dune actually adds some default c flags in compiling the Owl code. @jeremiedimino

I’m definitely not an expert in OpenMP, and maybe this is not the best place to discuss too much about C, but if there is any one who has experience of using OpenMP in OCaml/C hybrid systems like Owl @ryanrhymes, your help would be more than welcome. Thank you!