[ANN] mula 0.1.0, ML's radishal Universal Levenshtein Automata library

Hi all,

I’m happy to announce the release of my library mula. The package uses Universal Levenshtein Automata (ULA) to not only check if a word is within a certain edit distance of another, but to also output what the edit distance is! It uses the automata themselves to calculate edit distances. A fun use case for this is that we can feed a set of words to the automaton and immediately rank the words by their edit distance.

Mula supports both the standard Levenshtein edit distance as well as the Demarau-Levenshtein distance which counts transpositions of two adjacent characters as a single edit. I also support getting live error counts, so you can feed part of a string into an automaton, and get the minimum number of errors that have occurred so far.

I currently have matching working using non-deterministic ULA, but I have partially started the work toward the deterministic versions. It should be possible to pre-compute the DFAs for up to edit distance 3 and pack it with the library, never needing to be recomputed because the Universal Automata are independent of the input strings. But the non-deterministic automata support very large edit distances: (Sys.int_size - 1)/2, so they have value on their own.

This library came about from a desire to add a “did you mean” feature to a toy compiler, but not wanting to write the kind of dynamic programming code that you can find in the OCaml compiler [1] or merlin/spelll [2,3].

You can find the library here and the documentation here.
It’s not on opam yet, but I have submitted a pull request.
Update: It’s on opam now.

Happy OCamling!

References:

  1. Edit distance in the OCaml compiler. ocaml/misc.ml at e5e9c5fed56efdd67601e4dbbaebeb134aee361c · ocaml/ocaml · GitHub.
  2. Edit distance in merlin. merlin/misc.ml at 444f6e000f6b7dc58dac44d6ac096fc0e09894cc · ocaml/merlin · GitHub
  3. Edit distance in spelll. spelll/Spelll.ml at 3da1182256ff2507a0be812f945a7fe1a19adf9b · c-cube/spelll · GitHub
9 Likes

Some details:

I followed the paper by Touzet [1] as much as possible. If you take a look at the code, you’ll see a a lot of +1’s for 1-indexing. This was to keep the implementation as close to the paper as possible! (If you do want to check the implementation against the paper, note that the paper has a typo in Definition 2). For the Demarau-Levenshtein automaton, I adapted Figure 9 from Mitankin’s thesis [2]. I’m convinced that my adaptation works, but my adaptation of Touzet’s subsumption relation for Demarau-Levenshtein might be slightly sub-optimal. If you have question about the adaptation, feel free to ask!

mula does not completely replace c-cube’s spelll package. In particular I don’t support any indexs, etc. But there are some interesting differences in the automata they use. (w stands for the base word here)

  1. The spelll package creates the Levenshtein Automaton for a single string/word (LA_w), mula uses Universal Levenshtein Automata (ULA).
  2. Spelll computes a DFA from a non-deterministic automaton that uses eplison transitions. ULA do not have epsilon transitions, but for transitions it looks ahead into the base word w. Additionally the NFA’s states/transitions are computable on the fly, so there is no need to store the NFA in memory.
  3. Spelll's automata transitions using characters. mula computes a bitvector from an input character to transition from states to states. (Computing the bitvector is where the look ahead comes in).
  4. Spelll's automata return true/false, and uses a separate function to calculate edit distances. Mula uses the automaton itself to calculate edit distances, the outputs have type int option. (LA_w can be modified to support this though!)

References:

  1. On the Levenshtein Automaton and the Size of the Neighborhood of a Word. Hélène Touzet https://hal.archives-ouvertes.fr/hal-01360482/file/LATA2016.pdf
  2. Universal Levenstein Automata: Building and Properties. Petar Nikolaev Mitankin. https://store.fmi.uni-sofia.bg/fmi/logic/theses/mitankin-en.pdf
2 Likes

The examples given on the main Github page are very useful. Thanks!

Thanks. I just fixed some typos there and added some examples of using the provided functor!

1 Like