Hi everyone, I’m new here and I’m trying to learn ocaml in my free time.
Since it’s ready to start learning something and then further it because you never use it for real, I was thinking that I would need some small wins to try to use it with advantages at work.
I sometimes have to process big text files in a line by line fashion and in many cases this is done in python. The processing can take many hours or even days in the most extreme cases. Do you think that ocaml will offer visible speed ups in comparison to Python in this case? Of course the same could be done in C++ or, a bit better, in Go, but I’d rather doing it in a fun and concise language
In my experience OCaml is great for ad-hoc data processing (reading CSVs, parsing, groupping, summarizing data, etc). I’ve built multiple production systems for event processing and analytics simply relying on immutable data structures provided by the standard library. I also wrote a small library to help with stream processing since the built-in models aren’t optimized for certain use cases (see streaming). If you need more advanced analysis, you may want to take a look at raven.
Edit: Forgot to mention — yes, OCaml can be super fast if you ensure that your algorithms and memory usage are optimal.
I used OCaml a lot doing bioinformatics work, which involved a lot of this kind of crunching through large files. Sometimes I would compare the OCaml program to one written in Python or Ruby, and without really trying to make the OCaml version fast, i.e., just writing it in a natural way, the OCaml version would often be 2x-5x faster (often much faster than this too, if you care to optimize the OCaml), and the more involved the data crunching “script” was, generally the greater the speedup would be compared to Python/Ruby (similar to what Frederic_Loyer mentioned). For many of these kinds of tasks, OCaml would be fast enough for my use case that I wouldn’t bother taking the time to write the program in something faster like C, C++, or Rust.
The other nice thing about writing this type of stuff in OCaml vs Python/Ruby is that even when I thought it would be some sort of a throwaway type script, it would inevitably keep growing, or I would need to come back to it much later and modify it, and that is much nicer to do in OCaml than the alternatives.
I would also like to suggest you check out rizo’s streaming library, as well as c-cube’s iter library, as they are very nice for these sort of data processing pipelines programs.