OCaml for Data Science

Hello. Do you use OCaml for data science? - If so - could you please provide a short info about the status of data science in OCaml? I mean:

  • What makes OCaml good/bad for data science (long story short of your experience with OCaml in this area)?

  • What the OCaml alternatives for Python’s Pandas, NumPy, SciPy, etc.?

  • Do you know some frontier companies/products/projects that uses OCaml for data science?

  • Is there any problems that are related to data science and was solved by other platforms, but not by OCaml as a platform?

  • Maybe you may give me a good piece advise related to both OCaml and data science.

Thank you for your response.

3 Likes

I use ocaml for machine learning these days.
But I’m just a user in this area, not a researcher in those topics.

Here are my thoughts:

1 What makes OCaml good/bad for data science (long story short of your experience with OCaml in this area)?

OCaml is good for fast but correct prototyping of scientific software.

2 What the OCaml alternatives for Python’s Pandas, NumPy, SciPy, etc.?

Maybe owl for numpy.
But the ocaml ecosystem is for sure less developed than the python one in this area.

3 Do you know some frontier companies/products/projects that uses OCaml for data science?

People at NYU in the Hammer lab. If you consider bioinformatics data science (I guess they handle a lot of data and do science, so to me that’s data science).

4 Is there any problems that are related to data science and was solved by other platforms, but not by OCaml as a platform?

I don’t know.

5 Maybe you may give me a good piece advise related to both OCaml and data science.

Let’s say you are in OCaml, but you really need to access the functionality f
from Python module m.
You can write a small Python script to call m.f, then call this script from the ocaml side and read its result from there too.
Recently, I did this for several R packages I wanted to use (I want to stay in OCaml, R’s language is just crazy, like some kind of Perl on steroids).

2 Likes

I don’t use OCaml for data science, but recently I do data analysis a lot by Python and MATLAB. So I do not qualify to answer your question but still let me share my thought on you question.

  • (Bad) Lack of libraries: I never heard of equivalent of scipy (statistics), scikit-learn (machine learning) and deep-learning frameworks mature enough and comprehensive enough.
  • (Bad) Lack of popularity: Not many people know OCaml, so not easy to hire a programmer or find a collaborator
  • (Bad) Lack of recognition: If I use not-so-popular library implemented in OCaml in publication, people would get suspicious (of the quality of the tool).
  • (Good) Static type checking: Many trivial bugs can be find by type-checking, without tests. Developing tests is costly and not worth to do for one shot program, and often requires realistic data. If I can use OCaml, I don’t need to be disappointed by a crash after a long computation.

I don’t know. Having enough libraries, OCaml should have a competitive to Python etc.

I think creating a good framework/library and publish it at machine learning conference/journal (not functional programming/programming language conference/journal) would help better recognition.

1 Like

As @UnixJunkie suggested, I think it’s worth looking at Owl. It is still in early stages in many ways, but in other ways it seems fairly mature, and, its capabilities seem to be growing quickly. (Parts of the interface may change, though.)

4 Likes

About calling python: Are you using one of the Ocaml packages that are made to interface with python? (lymp, pymp, whatnot). Do you have experiences to share?

I have been using OCaml for simulations and a model fitting which could be considered data science to some extent. For that I used lacaml (good but bare-metal) and for optimization i resorted to gsl-ocaml. Compared to other ecosystems these are really basic tools. I was a bit limited in what I could achieve in a small amount of time.
I will second Owl. It’s basically trying to do what you are looking for. So far it has most of the functionality of numpy, some of scipy (optimization, special functions) and it also provides automatic differentiation and through that, neural network functionality.
I’m not aware of any OCaml library that does dataframes.

2 Likes

Some previous discussion on this

1 Like

OCaml would take a huge leap forward in this domain if Owl supported CUDA. Currently it has some preliminary support for OpenCL, but the rest of the world is using CUDA, presumably for good reason. This would be a good place to focus developer attention, if one is looking to contribute to OCaml’s standing. In essence, Owl is a mix of numpy, scipy and pytorch, but without proper CUDA support, it’s not going to take off nearly as well.

2 Likes

I miss CUDA support, too because we invested nvidia gpu a lot. But openCL support is also interesting because people start to feel nvidia’s monopoly dangerous. Also I want to tap my iMac pro’s GPU :grinning:

2 Likes

There are 2 types of people who do “Data Science”, (1) data-scientists who use algorithms and (2) those that create algorithms. OCaml isn’t great for the former but, in my highly biased opinion, fantastic for the latter.

On the creation aspect, I think there are three things that separate OCaml.

  1. Thinking about your problem as a series of transformations (like a compiler) is extremely productive. In this respect, OCaml’s simple yet expressive abstraction mechanisms shine.
  2. Couple that with fast native compilation and performance that allows you to iterate and diagnose quickly, that allows you to explore the problem domain.
  3. Finally, the deep thinking yet pragmatic language developers continuously impress me. I am sometimes just dumb-founded that with every new version of OCaml (or adding flambda) my not-so-good code continues to speed up by 2-3%.

One negative is that, there are some execution models that are not well supported by OCaml (SIMD, GPU) so your mileage might be limited.

On the usage perspective, yes, the libraries are limited, and chances are high that it won’t be available. Having said that, it isn’t as if the application of ‘data-science’ has converged on any one language or technique; within this year I’ve used Python, R, Octave (no MATLAB, I should have just rewritten the code to OCaml) and command line tools (VW). The practical use of these tools requires understanding the methods and inherent assumptions (ex. normalizing data for PCA or some hypothesis testing) that is difficult to remember and annoying to diagnose if it fails. I wish for good typed representation of data and methods that would prevent me from making these mistakes.

9 Likes

Even AMD is backing off of OpenCL now as far as I can tell, so the future isn’t bright. CUDA is pretty essential for any serious work nowadays.

1 Like

I never used one of the packages to interface with Python.
Maybe I will someday if something I want to use is only available from the Python ecosystem.
Recently, I prefer to tap into the R ecosystem: it is very well established and has a lot of quality contributions.

The only reason I see for CUDA being more used than OpenCL is the market share of nvidia’s graphics cards. I don’t like monopoly; they are bad for customers in the long term.

1 Like

Hi. Sorry for late reply. I’m using OCaml for data science at work.

What makes OCaml good/bad for data science (long story short of your experience with OCaml in this area)?

  • (Good) OCaml is fast.
  • (Good) Static typing prevents many small bugs. For example, Python often shows me errors like not found key in dict after long-time computation, but OCaml finds them in compile time (when we use records).
  • (Bad) OCaml Libraries for machine learning are less than Python.
  • (Bad) OCaml cannot support multicore.

What the OCaml alternatives for Python’s Pandas, NumPy, SciPy, etc.?

As some people mentioned, owl is similar to numpy.

Do you know some frontier companies/products/projects that uses OCaml for data science?

I don’t know. I use OCaml for data science personally. However, my colleagues use their favorite languages, e.g., Java, Python, etc.

Is there any problems that are related to data science and was solved by other platforms, but not by OCaml as a platform?

Lack of libraries, multi-core support and scalable distributed-memory processing environments (I know some opam packages such as rpc_parallel, but I cannot find enough examples).

Maybe you may give me a good piece advise related to both OCaml and data science.

Jupyter (http://jupyter.org/) is very useful and it can execute OCaml code: https://akabe.github.io/ocaml-jupyter/. A Docker image containing many packages for data science is available: https://github.com/akabe/docker-ocaml-jupyter-datascience, and some examples are at https://github.com/akabe/docker-ocaml-jupyter-datascience/tree/master/notebooks.
Please try them, if you are interested.

6 Likes

Sorry, but unless you are under windows, look for the following libraries in opam:
parmap, parany, ocamlnet, etc.
I run parallel OCaml programs everyday.
Parmap is a good start.
There are even more such libraries that I did not mention.

4 Likes

Some people (like this person https://github.com/examachine) are using the OCaml MPI bindings in production.
If you really want to create salable distributed applications, I would advize using the zmq OCaml bindings available in opam (zmq, it would force you to program in an agent-based style, a la Erlang).
I have done it once in the past, and it was pretty fun to write:
https://github.com/UnixJunkie/daft

3 Likes

I have used OCaml for machine learning in the past and the experience was very good but lack of multicore meant that no further work was pursued using OCaml. And yes, we used parmap, etc… but it just didn’t cut it for us, sorry.

It’s also worth noting that there is significantly higher demand for data scientists, i.e. people who know how to apply existing algorithms to real-world problems, than for developers who know how to implement such algorithms.

Data scientists are typically not so invested in programming languages and hence go for the easiest languages to learn. That explains a lot about why Python and R became so popular in that field despite being arguably technically inferior on many language dimensions.

I always find it amusing when people complain about the lack of multicore support as an explanation for why OCaml is not seeing the adoption it would otherwise deserve, especially in data science. Python and R also lack multicore support on a language level and are at least 2 orders of magnitude more popular in the field. But whenever I cross-validate OCaml numerical code with existing Python code, I typically find that pure Python solutions run 2 orders of magnitude more slowly. Even encountering in excess of 3 orders is no shocker to me. Try to beat that by just throwing more cores at the problem!

Python data science frameworks merely unload computationally demanding tasks to external libraries written in something fast and parallel. And that actually works just fine for most practical problems. This can obviously be done with OCaml, too, but that’s not enough.

The issue with OCaml is that its greatest strength is also its greatest weakness: a powerful static type system. Given a reasonable set of libraries, many if not most practical data science problems can be solved within a few hundred lines of code. This is usually too small to really notice the benefits of static typing. But a data scientist trying to learn OCaml would quickly get annoyed by a “nitpicking” type checker. They feel more productive (even if they aren’t) when being able to just run and test stuff.

If I had to give honest advice to someone who wants to build a career in the field, I’d absolutely not recommend OCaml for data science, and this has nothing to do with the language. The library / framework ecosystem is currently not at the same level though I consider it adequate for solving most problems. The bigger barriers are non-technical (psychological / social / business) and will make it rather unlikely that OCaml will see sufficient adoption in that field in the foreseeable future.

But if you can’t resist, your best bet would probably be niche applications that play to OCaml’s strengths. E.g. dealing with highly structured, nested, non-uniform data or model specifications is typically way more difficult in mainstream languages. Not everything in life is adequately captured by tensors and feedforward neural networks. In fact, I believe machine learning research as a field somewhat neglects highly structured problems for that very reason. If you find a killer application for OCaml in that area, you will likely not see much competition for a long time. But that would be a high risk endeavor.

18 Likes

This is 100% right. Types show their strength when you have a large body of code and you want to be resilient to changes and refactoring. Python seems like it ‘just runs’, even though it may crash at some point because nothing is checking your code.

Python also has the advantage of having very little added syntax. Programmers don’t like indent-based syntax, but for people coming from another discipline, not having let...in all over the place is a big benefit in comprehension.

I do, however, think that OCaml and good typing can benefit data science. It would be completely amazing if we could typecheck tensor dimensions properly, but even without that, the benefits of being able to rapidly and safely iterate are huge for experimentation, not to mention the fact that once you actually want to flesh out an ML algorithm and integrate it into a real system, the fragility of python is a killer, and you end up at the same place as the Reason guys who desperately want typing in their javascript apps.

3 Likes

I can’t argue, but: Static typing is still useful for small programs, and OCaml gives you that with minimal effort, especially if you don’t bother writing .mli files.

(Skipping .mli files seems to be slightly heretical. OCaml lets you choose, though. I love not having to annotate.)

2 Likes