[ANN] Esa 0.1.0 - Enhanced Suffix Arrary(and further plans)

gborough · July 11, 2025, 12:29am

I just ported the original C++ Enhanced Suffix Tree to pure OCaml, you can find it here: GitHub - gborough/esa: Enhanced Suffix Arrary.

It’s the first time I have attempted at writing low allocation/no allocation code in OCaml and I must say this has been a great learning experience for the past few weeks, and it makes me appreciate more how OCaml is able to provide low level tunings that match other low level languages, whilst staying functional at the same time.

One of my personal goals(also our company tech alignment) is to bring OCaml up to the same level of convenience as Python in some areas of AI/LLM. We are inspired by existing efforts in the OCaml community to take on this challenge and our plan of attack will be more or less similar. Currently we are tackling the following problems:

Porting Google Sentencepiece(in progress): Enhanced Suffix Arrary done as a dependency, Double-Array Trie and few other tokenizer utilities in progress.
Porting Hugging Face Tokenizers(in progress): Pending the completion of sentencepiece though less dependent codes are being converted.

The end product probably contains a mixture of pure OCaml as well as a fair amount of FFI code. I dread to think how they are going to look like obviously there will be a ton of verbatim translations to OCaml, but I have little doubt about matching C++/Rust performance most of the time. We’ll also look into the upcoming OxCaml extension to see if more performance can be eked out.

Hopefully we will have something to show for the community in the near future.

Topic		Replies	Views
Advantages of OCaml over Rust Community	76	32002	September 4, 2019
[ANN] reed-solomon-erasure 1.0.1 Ecosystem announce	13	1794	August 1, 2018
Significant performance difference between OCaml and F# Ecosystem	53	18886	July 9, 2022
OCaml compiler development newsletter, issue 4: October 2021 Community compiler-newsletter	9	2706	November 25, 2021
Using Ocaml to write network interface drivers Community	24	4714	February 14, 2020

[ANN] Esa 0.1.0 - Enhanced Suffix Arrary(and further plans)

Related topics