OCaml and OCaml libraries for data-oriented applications and data analysis

yarnton · December 3, 2020, 11:13pm

I have used OCaml sporadically in some hobby projects for more than a decade, and it’s a language I really like. The domain has been mostly compilers and static analyzers.

Professionally, I have been in a statistics role for many years. While Julia has improved the language and library ecosystem for data-oriented applications quite dramatically, I really miss a good statically typed language. Otherwise, I find it hard to scale from a prototype to a medium-sized application developed by a small team without running into some issues.

Why is not OCaml more popular in this area? What are some good libraries?

There’s Owl, a whole NumPy-SciPy-Pandas equivalent, which is quite impressive given the small amount of contributors: https://ocaml.xyz. But I haven’t seen many more dedicated libraries. However, I keep hearing OCaml is popular within insurance and finance.

For example, Jane Street is a famous OCaml user. What’s their stack like? Mostly custom code, aside from their alternative base libraries?

yawaramin · December 4, 2020, 12:41am

You may find this podcast interesting: https://signalsandthreads.com/python-ocaml-and-machine-learning/

cjr · December 4, 2020, 3:27am

Also see OCaml for Scientists. It’s older, but don’t let that put you off. If I recall correctly, there are a few parts that are out of date and/or no longer relevant, but OCaml is so stable that most of it just works. (There’s a PDF of the first chapter available on the site. It may be out of print; I got a copy through my university.)

bluddy · December 4, 2020, 6:08pm

I refer you to the relevant OCamlverse page, where Owl is of course the main star.

There are multiple reasons for OCaml not being so successful here:

Most type-based advantages are lost, since most types in scientific computing of any kind tend to be numeric arrays of various sizes and dimensions, and most static type systems don’t check array dimensions very well. This means you’re not getting much safety in return for using a typed language in this domain, other than for the code outside the numeric manipulation (which is still a big deal, just not as big of a deal).
OCaml still incurs the costs of static vs dynamic typing, but also doesn’t give that much speed benefit in this domain since its floats are often boxed and SIMD instructions aren’t used. This is why BLAS is used by Owl to accelerate code (just like numpy).
Immutable data types aren’t particularly well suited to the needs of scientific computing, where you often prefer to mutate large amounts of data in place. The one exception to this is GPU-based computation using CUDA, pytorch tensorflow etc, which use computation graphs of various kinds, and Owl has active work in this direction AFAIK.
Finally, there’s inertia. Ecosystem growth leads to more ecosystem growth in a feedback loop. The more libraries exist in OCaml for this niche, the more it’ll grow, and vice versa. Other languages have a large head start, making it a difficult task to offer a competitive library.

pveber · December 4, 2020, 10:34pm

Typing scalars, vectors and matrix like it is done in Lacaml is already very helpful when writing numerical code. It does help catching errors. Another simple and useful possibility would be to phantom type square matrices.

yarnton · December 5, 2020, 4:19pm

My original point was that even if simple static typing does not offer clear advantages on numerical code, there’s a big advantage on the rest of the code.

Typically, most code around numerics will involve tons of data preprocessing and business logic. Here, Hindley-Milner-like polymorphic (static) typing is really advantageous.

Furthermore, small extensions to the typing system like units (F#) or checking array shapes would also make numerics quite advantageous to write in OCaml.

bluddy · December 7, 2020, 10:16pm

This is available in Owl as well. It’s a minor benefit IMO – certainly not sufficient to switch from dynamic languages.

This isn’t so simple. Think of the functions that transform between shapes (squeeze, unsqueeze, index etc). Not only do they somehow need to work on all shape sizes (you can probably use macros for this), their behavior also often changes based on dynamic data (e.g. squeeze removes single-dimension entries in the shape). So you quickly get into dependent type territory.

Topic		Replies	Views
OCaml for Data Science Ecosystem machine-learning , data-science , statistics	25	12860	May 3, 2018
Will OCaml be a good choice for writing high performance, parallelizable machine learning libraries Learning	6	11580	January 2, 2018
Application-specific Improvements to the Ecosystem Community	52	2919	August 12, 2022
Killer use-cases (tools, libraries, domains, etc) for OCaml Ecosystem beginner , learn-ocaml	8	4179	October 21, 2022
Pure-OCaml Statistical Library: Feasible? Learning statistics	15	1564	December 3, 2020

OCaml and OCaml libraries for data-oriented applications and data analysis

Related topics