Pure-OCaml Statistical Library: Feasible?

jordanmerrick · November 27, 2020, 7:27pm

Hello all,

This is my first time really interacting with the OCaml community!

I’m interested in building a statistical library that’s OCaml all the way down, rather than using C bindings like Owl, or other libraries as seen on this discussion thread. I’m rather familiar with statistics but not so much with OCaml, so my question is whether there are significant issues that I may face when implementing a statistical library without using the C bindings? Or is there no reason to do so, perhaps a reason I’m missing?

Looking forward to hearing from the community!

mseri · November 27, 2020, 8:00pm

It should be fine, and native arrays are rather performant (see e.g. the dumb benchmark here). All of owl-base is, in fact, implemented in pure OCaml: feel free to take a peek at the code!

EDIT: I had incorrectly stated that owl stats was implemented in ocaml, but this is true only for the stats module in owl-base

jordanmerrick · November 27, 2020, 8:26pm

Excellent, thank you! I saw that you’re a member of their team, are you aware if there’s any interest in expanding Owl’s pure OCaml implementation? I don’t want to reinvent the wheel if I can contribute to a well-known library instead.

mseri · November 27, 2020, 10:14pm

If it is done organically with the main owl library sure!

For many of the extension we ended up creating external packages (like owl-symbolic, owl-ode, …) and contributed to the main project whatever we needed from the base library but we found missing/broken/improvable.

I think the best way would be to do something similar, so you can experiment and are free of any ties — and long design discussions — but can still benefit from the main implementation and tooling.

mars0i · November 27, 2020, 10:47pm

I’m curious about the motivation for replacing C bindings to well-established, optimized external libs with pure OCaml.

Is it just to make the statistical functions available in js_of_ocaml or other alternative platforms? Or maybe to have more control over running code on GPUs? Or maybe the functions in outside libraries aren’t flexible enough?

jordanmerrick · November 27, 2020, 11:12pm

A combination of a few things. I tend to do a lot of work with statistical computing and as I’m getting more into OCaml, it would be nice to be able to use it for more things. Also, I’ve found that the more flexibility I can have in libraries, the more I can do with them and the more I can learn. It would be a lot more difficult for me to rapidly change parts of the C library than it would for an OCaml implementation.

This isn’t necessarily meant to be an alternative to owl-stats, at least not initially, because most people don’t want/need the flexibility that a pure-OCaml implementation would bring (not to mention better optimization with Owl). As I learn and add to the library I’m hoping it will attract other contributors and grow from there, but for now it’s more of a way to learn in a practical application.

mars0i · November 28, 2020, 5:49pm

That all seems very worthwhile, @jordanmerrick. And it may have a side effect of making more functionality available for OCaml compiled to targets other than standard OS binaries.

UnixJunkie · December 1, 2020, 1:43am

If you start that effort, please don’t make the same mistakes as owl: a myriad of dependencies, some pretty hard to install.
Yes, a pure ocaml statistical library would be useful.
To get some inspiration, there are several maths/stats lib in ocaml (not pure OCaml):
gsl, pareto, oml, owl, for example.

jordanmerrick · December 2, 2020, 2:57am

Yes, I’m trying to keep it as close to the stdlib as possible! I am considering using Jane Street’s Base or Core libraries because they offer some very useful data structures that may allow me to make the library as efficient as possible, but they’re widely enough used that I think it’s alright. I’d much rather develop things in-house than rely on dependencies if possible.

I’ll be uploading my work to GitHub with some very basic functions once my university finals are done!

bluddy · December 2, 2020, 4:28am

The key thing to realize is that any pure OCaml library will be fairly slow relatively speaking, since OCaml tends to box floats and cannot make use of SIMD instructions by itself. This means that any application that doesn’t have to use pure OCaml (say, due to jsoo), will prefer a faster solution like Owl, which makes use of OpenBLAS for speed. Also, if I understand correctly, Owl itself is working on a pure OCaml backend alternative for things like jsoo.

UnixJunkie · December 2, 2020, 7:23am

Sometimes, I prefer portability over efficiency.
Statistical libraries don’t need to be so high performance.
They just need to be correct and offer some functionalities.

mseri · December 2, 2020, 8:06am

To keep it as close as possible to the stdlib, have also a look at Containers. It extends the stdlib in a light way instead of replacing it with a different one.

In any case, if you want to stick with pure OCaml, have a look at the facilities for managing ndarrays in owl-base, they are purely ocaml based and only depend on ocaml and dune. I think owl-base would benefit from more users and feedback and at the same time it would provide you with tested essential building blocks that you would have to reimplement otherwise.

n4323 · December 2, 2020, 9:34am

To move towards a better scientific ocaml ecosystem I think it’s important for libraries to be interoperable as much as possible. I’m hoping owl or maybe better, owl-base can serve as a common ground to define data types that all relevant libraries then use, which would make them interoperable. Julia is successfully using this kind of approach with many small independently developed libraries that agree on foundational interfaces afaik. For statistics, in addition to matrices, maybe one would want a Distribution.t, a Histogram.t and a data table. The latter seems to be particularly hairy in ocaml if one wants type safety, performance and flexibility. Also, +1 for containers.

n4323 · December 2, 2020, 10:15am

I just checked Owl_stats and found that e.g. mean and histogram are implemented for float arrays. That’s a bit disappointing I thought. Why not stick with Bigarray and provide generic statistics functions for all supported Bigarray kinds? Actually the types in Owl_distribution_generic it seem to be like that, so maybe the Owl_stats implementation is just preliminary as it stands.

mseri · December 2, 2020, 11:07am

Yes, your observation is correct. There is also an open issue about that: https://github.com/owlbarn/owl/issues/461 not many people used that module so it has not yet received enough attention. This is why I think it could be important to use it as a base to update or fix it as appropriate.

Concerning the interoperability, I agree. One thing that I liked is that all numerical/plotting libraries in OCaml, even the ones shelling out to other languages, are based around bigarrays, so you can really mix and match as you need: this is what we did in owl-ode when we added support to odepack and sundials, for example, or what I use when I want to plot using ocam-matplotlib or gnuplot, instead of using plplot. But there are more examples out there.

I think finding a common base and then creating libraries or alternative implementations on top of it, could allow the ecosystem to grow organically without splitting too much the efforts. Otherwise we keep reinventing the same primitives over and over and we remain with a split ecosystem.

kkirstein · December 3, 2020, 12:02pm

@UnixJunkie I think this is one of driving forces to replace C-code with OCaml in owl. Although, as @mseri said above, it is done organically with no high priority.

Topic		Replies	Views
Is there specialized math library for statistics? Learning	2	2246	May 15, 2020
OCaml and OCaml libraries for data-oriented applications and data analysis Community machine-learning , data-science , statistics	6	1729	December 7, 2020
Top 5 Favorite OCaml Libraries? Ecosystem community , learning , library , learn-ocaml	19	3013	October 20, 2022
Application-specific Improvements to the Ecosystem Community	52	2925	August 12, 2022
Will OCaml be a good choice for writing high performance, parallelizable machine learning libraries Learning	6	11601	January 2, 2018

Pure-OCaml Statistical Library: Feasible?

Related topics