I’m happy to announce the release of my library mula. The package uses Universal Levenshtein Automata (ULA) to not only check if a word is within a certain edit distance of another, but to also output what the edit distance is! It uses the automata themselves to calculate edit distances. A fun use case for this is that we can feed a set of words to the automaton and immediately rank the words by their edit distance.
Mula supports both the standard Levenshtein edit distance as well as the Demarau-Levenshtein distance which counts transpositions of two adjacent characters as a single edit. I also support getting live error counts, so you can feed part of a string into an automaton, and get the minimum number of errors that have occurred so far.
I currently have matching working using non-deterministic ULA, but I have partially started the work toward the deterministic versions. It should be possible to pre-compute the DFAs for up to edit distance 3 and pack it with the library, never needing to be recomputed because the Universal Automata are independent of the input strings. But the non-deterministic automata support very large edit distances: (Sys.int_size - 1)/2, so they have value on their own.
This library came about from a desire to add a “did you mean” feature to a toy compiler, but not wanting to write the kind of dynamic programming code that you can find in the OCaml compiler [1] or merlin/spelll [2,3].
You can find the library here and the documentation here. It’s not on opam yet, but I have submitted a pull request. Update: It’s on opam now.
I followed the paper by Touzet [1] as much as possible. If you take a look at the code, you’ll see a a lot of +1’s for 1-indexing. This was to keep the implementation as close to the paper as possible! (If you do want to check the implementation against the paper, note that the paper has a typo in Definition 2). For the Demarau-Levenshtein automaton, I adapted Figure 9 from Mitankin’s thesis [2]. I’m convinced that my adaptation works, but my adaptation of Touzet’s subsumption relation for Demarau-Levenshtein might be slightly sub-optimal. If you have question about the adaptation, feel free to ask!
mula does not completely replace c-cube’s spelll package. In particular I don’t support any indexs, etc. But there are some interesting differences in the automata they use. (w stands for the base word here)
The spelll package creates the Levenshtein Automaton for a single string/word (LA_w), mula uses Universal Levenshtein Automata (ULA).
Spelll computes a DFA from a non-deterministic automaton that uses eplison transitions. ULA do not have epsilon transitions, but for transitions it looks ahead into the base word w. Additionally the NFA’s states/transitions are computable on the fly, so there is no need to store the NFA in memory.
Spelll's automata transitions using characters. mula computes a bitvector from an input character to transition from states to states. (Computing the bitvector is where the look ahead comes in).
Spelll's automata return true/false, and uses a separate function to calculate edit distances. Mula uses the automaton itself to calculate edit distances, the outputs have type int option. (LA_w can be modified to support this though!)