A few days ago I posted about the T-Digest library. Today I’m back with another small algorithmic library:
You’re probably familiar with the Levenshtein distance: the minimum number of single character edits (additions, deletions, replacements) between 2 strings. Side note: this library can operate on more than strings.
Calculating the Levenshtein famously requires a lot of checks and comparisons.
Instead of calculating the distance, this library instead returns whether 2 values are within D distance of each other (
bool). There has been substantial development on the topic of Levenshtein automata in the last decade. See the “Fast String Correction with Levenshtein-Automata” paper by Klaus Schulz and Stoyan Mihov.
Using the graph construction technique from the paper, plus a few ideas from this article and several additional low level optimizations of my own, this library can answer the question (“are these 2 values within D edits of each other”) in the 1-10µs range, scaling linearly with the length of the values.
- a Functor is provided to enable comparisons across any arbitrary types
- string comparisons are provided (functorized) out of the box
- reuse the same automaton across all comparisons with the same
max_edits, regardless of the type of the values being compared
max_editsmust be between
3(inclusively) due to the astronomical scaling factor during graph construction
- most comparisons take under 5 µs, depending on the length of the values
It’s fast enough that it can be used instead of
String.(=) for some tasks and/or on large datasets.
All comments and feedback are welcome! I hope this library proves useful to the OCaml ecosystem as a whole. I’ll be back in a few days with a special algorithmic library to complete this little trilogy.