If I am right, we have sin and cos but not the combination of the two.
On modern hardware, I think there are special instructions to compute both at once.
Because, sometimes you need both the sin and cos of an angle.

Having said that even the software implementation of sincos (if available) might be faster than calling sin and cos separately.
YMMV, I’d suggest to do some measurements (the results may also be different based on which libm implementation is used, or which architecture it is compiled for)

You could probably get decent speed ups if you need to compute lots of them and can use some sort of vectorized / data parallel interface. Something like:

val sincos : radians_in:float array
-> sin_out:float array
-> cos_out:float array
-> unit

This should allow an implementation that could potentially take advantage of SIMD instructions and good level of ILP (and even exploiting multicore).

Of course, even if the sincos computation would be speeded up, using a plain array based interface like this may be cumbersome and it may be difficult to actually realize any performance benefits. What you’d really want is a way to compose data parallel computations that minimizes the number of intermediate data structures and allows the compiler to generete fused inner loops. Stream Fusion is an old paper on the topic. Here is a newer one: Stream Fusion, to Completeness.

The main trick is that the combinators do not have closed loops. I’m not sure how well the OCaml compiler inlines loopless higher-order functions. MLton used to be pretty good at it. My toy Fωμ compiler can also do it. Hopefully Flambda 2 will open up these sorts of things for OCaml.

That is the point. Most modern software implementations of sine and cosine rely on the following trigonometric identities: sin (a + b) = sin a * cos b + cos a * sin b and cos (a + b) = cos a * cos b - sin a * sin b. So, if your mathematical library has computed one (meaning it has already computed the four values cos a, sin a, cos b, and sin b), it is only two multiplications and one addition away from computing the other.

That is why both GCC and Clang optimize the following code into a single call to sincos(x) followed by an addition. (For Clang, you might need to pass -ffast-math.)

If the function is not available then an ifdef macro could chose to call sin and cos sequentially. To make ‘noalloc’ direct C calls from OCaml possible (without intermediate stubs) perhaps there could be an ‘ml_sincos’ function that is either defined to be equal to ‘sincos’ or the above fallback implementation. This could be prototyped as a small library on ‘opam’ initially and the performance benefits and usefulness measured.
(Although ‘sincos’ needs to return 2 values, so it is not immediately obvious to me whether a ‘noalloc’ implementation would be possible here, unless you’ve preallocated an array or bigarray to store the results in)

If you have any numerical code dealing with rotations, as soon
as you have an angle, you are usually interested both by its sinus and its cosinus.
This is useful for computer graphics but also molecular simulation software.