OCaml and OCaml libraries for data-oriented applications and data analysis

I have used OCaml sporadically in some hobby projects for more than a decade, and it’s a language I really like. The domain has been mostly compilers and static analyzers.

Professionally, I have been in a statistics role for many years. While Julia has improved the language and library ecosystem for data-oriented applications quite dramatically, I really miss a good statically typed language. Otherwise, I find it hard to scale from a prototype to a medium-sized application developed by a small team without running into some issues.

Why is not OCaml more popular in this area? What are some good libraries?

There’s Owl, a whole NumPy-SciPy-Pandas equivalent, which is quite impressive given the small amount of contributors: https://ocaml.xyz. But I haven’t seen many more dedicated libraries. However, I keep hearing OCaml is popular within insurance and finance.

For example, Jane Street is a famous OCaml user. What’s their stack like? Mostly custom code, aside from their alternative base libraries?

1 Like

You may find this podcast interesting: https://signalsandthreads.com/python-ocaml-and-machine-learning/

2 Likes

Also see OCaml for Scientists. It’s older, but don’t let that put you off. If I recall correctly, there are a few parts that are out of date and/or no longer relevant, but OCaml is so stable that most of it just works. (There’s a PDF of the first chapter available on the site. It may be out of print; I got a copy through my university.)

3 Likes

I refer you to the relevant OCamlverse page, where Owl is of course the main star.

There are multiple reasons for OCaml not being so successful here:

  • Most type-based advantages are lost, since most types in scientific computing of any kind tend to be numeric arrays of various sizes and dimensions, and most static type systems don’t check array dimensions very well. This means you’re not getting much safety in return for using a typed language in this domain, other than for the code outside the numeric manipulation (which is still a big deal, just not as big of a deal).
  • OCaml still incurs the costs of static vs dynamic typing, but also doesn’t give that much speed benefit in this domain since its floats are often boxed and SIMD instructions aren’t used. This is why BLAS is used by Owl to accelerate code (just like numpy).
  • Immutable data types aren’t particularly well suited to the needs of scientific computing, where you often prefer to mutate large amounts of data in place. The one exception to this is GPU-based computation using CUDA, pytorch tensorflow etc, which use computation graphs of various kinds, and Owl has active work in this direction AFAIK.
  • Finally, there’s inertia. Ecosystem growth leads to more ecosystem growth in a feedback loop. The more libraries exist in OCaml for this niche, the more it’ll grow, and vice versa. Other languages have a large head start, making it a difficult task to offer a competitive library.
2 Likes

Typing scalars, vectors and matrix like it is done in Lacaml is already very helpful when writing numerical code. It does help catching errors. Another simple and useful possibility would be to phantom type square matrices.

2 Likes

My original point was that even if simple static typing does not offer clear advantages on numerical code, there’s a big advantage on the rest of the code.

Typically, most code around numerics will involve tons of data preprocessing and business logic. Here, Hindley-Milner-like polymorphic (static) typing is really advantageous.

Furthermore, small extensions to the typing system like units (F#) or checking array shapes would also make numerics quite advantageous to write in OCaml.

1 Like

This is available in Owl as well. It’s a minor benefit IMO – certainly not sufficient to switch from dynamic languages.

This isn’t so simple. Think of the functions that transform between shapes (squeeze, unsqueeze, index etc). Not only do they somehow need to work on all shape sizes (you can probably use macros for this), their behavior also often changes based on dynamic data (e.g. squeeze removes single-dimension entries in the shape). So you quickly get into dependent type territory.

1 Like