[ANN] fuzzy_compare

A few days ago I posted about the T-Digest library. Today I’m back with another small algorithmic library:

Github link

You’re probably familiar with the Levenshtein distance: the minimum number of single character edits (additions, deletions, replacements) between 2 strings. Side note: this library can operate on more than strings.

Calculating the Levenshtein famously requires a lot of checks and comparisons.

Instead of calculating the distance, this library instead returns whether 2 values are within D distance of each other (bool). There has been substantial development on the topic of Levenshtein automata in the last decade. See the “Fast String Correction with Levenshtein-Automata” paper by Klaus Schulz and Stoyan Mihov.

Using the graph construction technique from the paper, plus a few ideas from this article and several additional low level optimizations of my own, this library can answer the question (“are these 2 values within D edits of each other”) in the 1-10µs range, scaling linearly with the length of the values.

  • a Functor is provided to enable comparisons across any arbitrary types
  • string comparisons are provided (functorized) out of the box
  • reuse the same automaton across all comparisons with the same max_edits, regardless of the type of the values being compared
  • max_edits must be between 0 and 3 (inclusively) due to the astronomical scaling factor during graph construction
  • most comparisons take under 5 µs, depending on the length of the values

It’s fast enough that it can be used instead of String.(=) for some tasks and/or on large datasets.

All comments and feedback are welcome! I hope this library proves useful to the OCaml ecosystem as a whole. I’ll be back in a few days with a special algorithmic library to complete this little trilogy.

15 Likes