Pure-OCaml Statistical Library: Feasible?

Hello all,

This is my first time really interacting with the OCaml community!

I’m interested in building a statistical library that’s OCaml all the way down, rather than using C bindings like Owl, or other libraries as seen on this discussion thread. I’m rather familiar with statistics but not so much with OCaml, so my question is whether there are significant issues that I may face when implementing a statistical library without using the C bindings? Or is there no reason to do so, perhaps a reason I’m missing?

Looking forward to hearing from the community!

5 Likes

It should be fine, and native arrays are rather performant (see e.g. the dumb benchmark here). All of owl-base is, in fact, implemented in pure OCaml: feel free to take a peek at the code!

EDIT: I had incorrectly stated that owl stats was implemented in ocaml, but this is true only for the stats module in owl-base

6 Likes

Excellent, thank you! I saw that you’re a member of their team, are you aware if there’s any interest in expanding Owl’s pure OCaml implementation? I don’t want to reinvent the wheel if I can contribute to a well-known library instead.

2 Likes

If it is done organically with the main owl library sure!

For many of the extension we ended up creating external packages (like owl-symbolic, owl-ode, …) and contributed to the main project whatever we needed from the base library but we found missing/broken/improvable.

I think the best way would be to do something similar, so you can experiment and are free of any ties — and long design discussions — but can still benefit from the main implementation and tooling.

1 Like

I’m curious about the motivation for replacing C bindings to well-established, optimized external libs with pure OCaml.

Is it just to make the statistical functions available in js_of_ocaml or other alternative platforms? Or maybe to have more control over running code on GPUs? Or maybe the functions in outside libraries aren’t flexible enough?

1 Like

A combination of a few things. I tend to do a lot of work with statistical computing and as I’m getting more into OCaml, it would be nice to be able to use it for more things. Also, I’ve found that the more flexibility I can have in libraries, the more I can do with them and the more I can learn. It would be a lot more difficult for me to rapidly change parts of the C library than it would for an OCaml implementation.

This isn’t necessarily meant to be an alternative to owl-stats, at least not initially, because most people don’t want/need the flexibility that a pure-OCaml implementation would bring (not to mention better optimization with Owl). As I learn and add to the library I’m hoping it will attract other contributors and grow from there, but for now it’s more of a way to learn in a practical application.

4 Likes

That all seems very worthwhile, @jordanmerrick. And it may have a side effect of making more functionality available for OCaml compiled to targets other than standard OS binaries.

2 Likes

If you start that effort, please don’t make the same mistakes as owl: a myriad of dependencies, some pretty hard to install.
Yes, a pure ocaml statistical library would be useful.
To get some inspiration, there are several maths/stats lib in ocaml (not pure OCaml):
gsl, pareto, oml, owl, for example.

2 Likes

Yes, I’m trying to keep it as close to the stdlib as possible! I am considering using Jane Street’s Base or Core libraries because they offer some very useful data structures that may allow me to make the library as efficient as possible, but they’re widely enough used that I think it’s alright. I’d much rather develop things in-house than rely on dependencies if possible.

I’ll be uploading my work to GitHub with some very basic functions once my university finals are done!

The key thing to realize is that any pure OCaml library will be fairly slow relatively speaking, since OCaml tends to box floats and cannot make use of SIMD instructions by itself. This means that any application that doesn’t have to use pure OCaml (say, due to jsoo), will prefer a faster solution like Owl, which makes use of OpenBLAS for speed. Also, if I understand correctly, Owl itself is working on a pure OCaml backend alternative for things like jsoo.

4 Likes

Sometimes, I prefer portability over efficiency.
Statistical libraries don’t need to be so high performance.
They just need to be correct and offer some functionalities.

3 Likes

To keep it as close as possible to the stdlib, have also a look at Containers. It extends the stdlib in a light way instead of replacing it with a different one.

In any case, if you want to stick with pure OCaml, have a look at the facilities for managing ndarrays in owl-base, they are purely ocaml based and only depend on ocaml and dune. I think owl-base would benefit from more users and feedback and at the same time it would provide you with tested essential building blocks that you would have to reimplement otherwise.

6 Likes

To move towards a better scientific ocaml ecosystem I think it’s important for libraries to be interoperable as much as possible. I’m hoping owl or maybe better, owl-base can serve as a common ground to define data types that all relevant libraries then use, which would make them interoperable. Julia is successfully using this kind of approach with many small independently developed libraries that agree on foundational interfaces afaik. For statistics, in addition to matrices, maybe one would want a Distribution.t, a Histogram.t and a data table. The latter seems to be particularly hairy in ocaml if one wants type safety, performance and flexibility. Also, +1 for containers.

5 Likes

I just checked Owl_stats and found that e.g. mean and histogram are implemented for float arrays. That’s a bit disappointing I thought. Why not stick with Bigarray and provide generic statistics functions for all supported Bigarray kinds? Actually the types in Owl_distribution_generic it seem to be like that, so maybe the Owl_stats implementation is just preliminary as it stands.

1 Like

Yes, your observation is correct. There is also an open issue about that: https://github.com/owlbarn/owl/issues/461 not many people used that module so it has not yet received enough attention. This is why I think it could be important to use it as a base to update or fix it as appropriate.

Concerning the interoperability, I agree. One thing that I liked is that all numerical/plotting libraries in OCaml, even the ones shelling out to other languages, are based around bigarrays, so you can really mix and match as you need: this is what we did in owl-ode when we added support to odepack and sundials, for example, or what I use when I want to plot using ocam-matplotlib or gnuplot, instead of using plplot. But there are more examples out there.

I think finding a common base and then creating libraries or alternative implementations on top of it, could allow the ecosystem to grow organically without splitting too much the efforts. Otherwise we keep reinventing the same primitives over and over and we remain with a split ecosystem.

8 Likes

@UnixJunkie I think this is one of driving forces to replace C-code with OCaml in owl. Although, as @mseri said above, it is done organically with no high priority.