OCaml for Data Science

Some previous discussion on this

1 Like

OCaml would take a huge leap forward in this domain if Owl supported CUDA. Currently it has some preliminary support for OpenCL, but the rest of the world is using CUDA, presumably for good reason. This would be a good place to focus developer attention, if one is looking to contribute to OCaml’s standing. In essence, Owl is a mix of numpy, scipy and pytorch, but without proper CUDA support, it’s not going to take off nearly as well.

2 Likes

I miss CUDA support, too because we invested nvidia gpu a lot. But openCL support is also interesting because people start to feel nvidia’s monopoly dangerous. Also I want to tap my iMac pro’s GPU :grinning:

2 Likes

There are 2 types of people who do “Data Science”, (1) data-scientists who use algorithms and (2) those that create algorithms. OCaml isn’t great for the former but, in my highly biased opinion, fantastic for the latter.

On the creation aspect, I think there are three things that separate OCaml.

  1. Thinking about your problem as a series of transformations (like a compiler) is extremely productive. In this respect, OCaml’s simple yet expressive abstraction mechanisms shine.
  2. Couple that with fast native compilation and performance that allows you to iterate and diagnose quickly, that allows you to explore the problem domain.
  3. Finally, the deep thinking yet pragmatic language developers continuously impress me. I am sometimes just dumb-founded that with every new version of OCaml (or adding flambda) my not-so-good code continues to speed up by 2-3%.

One negative is that, there are some execution models that are not well supported by OCaml (SIMD, GPU) so your mileage might be limited.

On the usage perspective, yes, the libraries are limited, and chances are high that it won’t be available. Having said that, it isn’t as if the application of ‘data-science’ has converged on any one language or technique; within this year I’ve used Python, R, Octave (no MATLAB, I should have just rewritten the code to OCaml) and command line tools (VW). The practical use of these tools requires understanding the methods and inherent assumptions (ex. normalizing data for PCA or some hypothesis testing) that is difficult to remember and annoying to diagnose if it fails. I wish for good typed representation of data and methods that would prevent me from making these mistakes.

9 Likes

Even AMD is backing off of OpenCL now as far as I can tell, so the future isn’t bright. CUDA is pretty essential for any serious work nowadays.

1 Like

I never used one of the packages to interface with Python.
Maybe I will someday if something I want to use is only available from the Python ecosystem.
Recently, I prefer to tap into the R ecosystem: it is very well established and has a lot of quality contributions.

The only reason I see for CUDA being more used than OpenCL is the market share of nvidia’s graphics cards. I don’t like monopoly; they are bad for customers in the long term.

1 Like

Hi. Sorry for late reply. I’m using OCaml for data science at work.

What makes OCaml good/bad for data science (long story short of your experience with OCaml in this area)?

  • (Good) OCaml is fast.
  • (Good) Static typing prevents many small bugs. For example, Python often shows me errors like not found key in dict after long-time computation, but OCaml finds them in compile time (when we use records).
  • (Bad) OCaml Libraries for machine learning are less than Python.
  • (Bad) OCaml cannot support multicore.

What the OCaml alternatives for Python’s Pandas, NumPy, SciPy, etc.?

As some people mentioned, owl is similar to numpy.

Do you know some frontier companies/products/projects that uses OCaml for data science?

I don’t know. I use OCaml for data science personally. However, my colleagues use their favorite languages, e.g., Java, Python, etc.

Is there any problems that are related to data science and was solved by other platforms, but not by OCaml as a platform?

Lack of libraries, multi-core support and scalable distributed-memory processing environments (I know some opam packages such as rpc_parallel, but I cannot find enough examples).

Maybe you may give me a good piece advise related to both OCaml and data science.

Jupyter (http://jupyter.org/) is very useful and it can execute OCaml code: https://akabe.github.io/ocaml-jupyter/. A Docker image containing many packages for data science is available: https://github.com/akabe/docker-ocaml-jupyter-datascience, and some examples are at https://github.com/akabe/docker-ocaml-jupyter-datascience/tree/master/notebooks.
Please try them, if you are interested.

6 Likes

Sorry, but unless you are under windows, look for the following libraries in opam:
parmap, parany, ocamlnet, etc.
I run parallel OCaml programs everyday.
Parmap is a good start.
There are even more such libraries that I did not mention.

4 Likes

Some people (like this person https://github.com/examachine) are using the OCaml MPI bindings in production.
If you really want to create salable distributed applications, I would advize using the zmq OCaml bindings available in opam (zmq, it would force you to program in an agent-based style, a la Erlang).
I have done it once in the past, and it was pretty fun to write:
https://github.com/UnixJunkie/daft

3 Likes

I have used OCaml for machine learning in the past and the experience was very good but lack of multicore meant that no further work was pursued using OCaml. And yes, we used parmap, etc… but it just didn’t cut it for us, sorry.

It’s also worth noting that there is significantly higher demand for data scientists, i.e. people who know how to apply existing algorithms to real-world problems, than for developers who know how to implement such algorithms.

Data scientists are typically not so invested in programming languages and hence go for the easiest languages to learn. That explains a lot about why Python and R became so popular in that field despite being arguably technically inferior on many language dimensions.

I always find it amusing when people complain about the lack of multicore support as an explanation for why OCaml is not seeing the adoption it would otherwise deserve, especially in data science. Python and R also lack multicore support on a language level and are at least 2 orders of magnitude more popular in the field. But whenever I cross-validate OCaml numerical code with existing Python code, I typically find that pure Python solutions run 2 orders of magnitude more slowly. Even encountering in excess of 3 orders is no shocker to me. Try to beat that by just throwing more cores at the problem!

Python data science frameworks merely unload computationally demanding tasks to external libraries written in something fast and parallel. And that actually works just fine for most practical problems. This can obviously be done with OCaml, too, but that’s not enough.

The issue with OCaml is that its greatest strength is also its greatest weakness: a powerful static type system. Given a reasonable set of libraries, many if not most practical data science problems can be solved within a few hundred lines of code. This is usually too small to really notice the benefits of static typing. But a data scientist trying to learn OCaml would quickly get annoyed by a “nitpicking” type checker. They feel more productive (even if they aren’t) when being able to just run and test stuff.

If I had to give honest advice to someone who wants to build a career in the field, I’d absolutely not recommend OCaml for data science, and this has nothing to do with the language. The library / framework ecosystem is currently not at the same level though I consider it adequate for solving most problems. The bigger barriers are non-technical (psychological / social / business) and will make it rather unlikely that OCaml will see sufficient adoption in that field in the foreseeable future.

But if you can’t resist, your best bet would probably be niche applications that play to OCaml’s strengths. E.g. dealing with highly structured, nested, non-uniform data or model specifications is typically way more difficult in mainstream languages. Not everything in life is adequately captured by tensors and feedforward neural networks. In fact, I believe machine learning research as a field somewhat neglects highly structured problems for that very reason. If you find a killer application for OCaml in that area, you will likely not see much competition for a long time. But that would be a high risk endeavor.

15 Likes

This is 100% right. Types show their strength when you have a large body of code and you want to be resilient to changes and refactoring. Python seems like it ‘just runs’, even though it may crash at some point because nothing is checking your code.

Python also has the advantage of having very little added syntax. Programmers don’t like indent-based syntax, but for people coming from another discipline, not having let...in all over the place is a big benefit in comprehension.

I do, however, think that OCaml and good typing can benefit data science. It would be completely amazing if we could typecheck tensor dimensions properly, but even without that, the benefits of being able to rapidly and safely iterate are huge for experimentation, not to mention the fact that once you actually want to flesh out an ML algorithm and integrate it into a real system, the fragility of python is a killer, and you end up at the same place as the Reason guys who desperately want typing in their javascript apps.

3 Likes

I can’t argue, but: Static typing is still useful for small programs, and OCaml gives you that with minimal effort, especially if you don’t bother writing .mli files.

(Skipping .mli files seems to be slightly heretical. OCaml lets you choose, though. I love not having to annotate.)

2 Likes

Absolutely, I do agree that even small programs benefit from static typing.

The problem is cognitive myopia among beginners. Their initial lack of programming or language competence will cause them to run into typing errors too frequently, and they also may not have the ability yet to quickly understand what the compiler is telling them. Furthermore, testing and fixing bugs feels more productive to beginners than thinking carefully about a compiler message, which might guide them to their solution more efficiently. It’s maybe also an ego thing that people don’t like being told by machines that they suck at programming.

That’s why many give up too early and believe that static typing “gets in your way”, which is among the most frequent objections to OCaml I have heard. They don’t see themselves as becoming more proficient at using the type system. In their mind having to deal with these issues for small programs means programming with static types must be an even worse experience with big programs even though the opposite is the case. It’s hence of little surprise that languages that cater to their prejudices are more popular.

I think the level of competence where people really get sold on modern static type systems is when they start explicitly designing their programs to leverage the type checker, but this takes typically at least months of experience. I once saw a beginner switch from sum types to matching characters based on the reason that this way the compiler would complain less about their code! It’s sort of ironic that expert users do the exact opposite, i.e. write programs such that the compiler will scream at them as often as possible. I guess it’s not just an intellectual preference, but programming that way may require a certain level of emotional resilience.

13 Likes

In my personal experience (coming from python), the compiler messages weren’t that off-putting. Learning to deal with types was compensated by the ‘shiny new toy’ effect.
I will say that when trying something out in OCaml I usually need to spend some time first thinking about the right types. This is a preliminary design step that takes some experience. I now think that it actually helps me structure the problem, but it can feel slow to get started with something new compared to python. I imagine people may find this offputting.
The biggest problem continues to be availability of (bindings to) libraries.

1 Like

I guess the fact that you are posting about OCaml is proof of some degree of self-selection. A naturally high degree of exploratory behavior is surely helpful in overcoming initial obstacles. In my experience such curiosity is sadly in the minority.

Python and R have a richer library ecosystem, but that begs the question: why? It’s not like OCaml hasn’t been around for decades already. My best guess for an explanation is the initially steeper learning curve imposed by the type system that drives away most newcomers towards easier languages. Once they start implementing more and more libraries in those, the snowball effect takes over.

1 Like

I’m thinking you’re right. I hesitate to recommend OCaml to colleagues because I feel it’s not a good idea to start learning it if you need to get something done quickly. I still hope for the snowball effect going forward.

It would be great if OCaml could tap into the enormous expansion of Julia libraries. What would it take to call into Julia? They have a type system too, would this be a chance to get efficient cross-language communication?

1 Like

I don’t disagree with these points, but as someone who’s also a fan of Lisps, I think there are benefits to dynamically typed languages even when you get all of the types right–namely that you can mix types more freely when it makes your code simpler. There are dangers there, but there are tradeoffs to everything. OCaml is naturally attractive to people who weight the tradeoffs in one direction rather than the other; I’m not trying to convince anyone here to prefer Python or Clojure, etc., and OCaml is my current language of choice. I just want to point out that the reasons that people are attracted to dynamically typed languages are not only because it allows them to make type errors during initial programming. :slight_smile:

I think that’s especially true of exploratory code. When you want stability and resilience, the safety of types tend to dominate.

This is one thing I’ve realized recently: all of software engineering makes assumptions about the end product – specifically, that you’re striving for a mostly stable artifact. The artifact needs to evolve, but the majority of the code needs to be stable. This isn’t necessarily the case for data science and experimentation, and many aspects seen as positive in software engineering (such as DRY) can become liabilities. To some degree this is true about types as well.

3 Likes