I just ported the original C++ Enhanced Suffix Tree to pure OCaml, you can find it here: GitHub - gborough/esa: Enhanced Suffix Arrary.
It’s the first time I have attempted at writing low allocation/no allocation code in OCaml and I must say this has been a great learning experience for the past few weeks, and it makes me appreciate more how OCaml is able to provide low level tunings that match other low level languages, whilst staying functional at the same time.
One of my personal goals(also our company tech alignment) is to bring OCaml up to the same level of convenience as Python in some areas of AI/LLM. We are inspired by existing efforts in the OCaml community to take on this challenge and our plan of attack will be more or less similar. Currently we are tackling the following problems:
- Porting Google Sentencepiece(in progress): Enhanced Suffix Arrary done as a dependency, Double-Array Trie and few other tokenizer utilities in progress.
- Porting Hugging Face Tokenizers(in progress): Pending the completion of sentencepiece though less dependent codes are being converted.
The end product probably contains a mixture of pure OCaml as well as a fair amount of FFI code. I dread to think how they are going to look like obviously there will be a ton of verbatim translations to OCaml, but I have little doubt about matching C++/Rust performance most of the time. We’ll also look into the upcoming OxCaml extension to see if more performance can be eked out.
Hopefully we will have something to show for the community in the near future.