[ANN] mula 0.1.0, ML's radishal Universal Levenshtein Automata library

ifazk · June 20, 2021, 8:18am

Hi all,

I’m happy to announce the release of my library mula. The package uses Universal Levenshtein Automata (ULA) to not only check if a word is within a certain edit distance of another, but to also output what the edit distance is! It uses the automata themselves to calculate edit distances. A fun use case for this is that we can feed a set of words to the automaton and immediately rank the words by their edit distance.

Mula supports both the standard Levenshtein edit distance as well as the Demarau-Levenshtein distance which counts transpositions of two adjacent characters as a single edit. I also support getting live error counts, so you can feed part of a string into an automaton, and get the minimum number of errors that have occurred so far.

I currently have matching working using non-deterministic ULA, but I have partially started the work toward the deterministic versions. It should be possible to pre-compute the DFAs for up to edit distance 3 and pack it with the library, never needing to be recomputed because the Universal Automata are independent of the input strings. But the non-deterministic automata support very large edit distances: (Sys.int_size - 1)/2, so they have value on their own.

This library came about from a desire to add a “did you mean” feature to a toy compiler, but not wanting to write the kind of dynamic programming code that you can find in the OCaml compiler [1] or merlin/spelll [2,3].

You can find the library here and the documentation here.
~~It’s not on opam yet, but I have submitted a pull request.~~
Update: It’s on opam now.

Happy OCamling!

References:

Edit distance in the OCaml compiler. ocaml/misc.ml at e5e9c5fed56efdd67601e4dbbaebeb134aee361c · ocaml/ocaml · GitHub.
Edit distance in merlin. merlin/misc.ml at 444f6e000f6b7dc58dac44d6ac096fc0e09894cc · ocaml/merlin · GitHub
Edit distance in spelll. spelll/Spelll.ml at 3da1182256ff2507a0be812f945a7fe1a19adf9b · c-cube/spelll · GitHub

ifazk · June 20, 2021, 8:23am

Some details:

I followed the paper by Touzet [1] as much as possible. If you take a look at the code, you’ll see a a lot of +1’s for 1-indexing. This was to keep the implementation as close to the paper as possible! (If you do want to check the implementation against the paper, note that the paper has a typo in Definition 2). For the Demarau-Levenshtein automaton, I adapted Figure 9 from Mitankin’s thesis [2]. I’m convinced that my adaptation works, but my adaptation of Touzet’s subsumption relation for Demarau-Levenshtein might be slightly sub-optimal. If you have question about the adaptation, feel free to ask!

mula does not completely replace c-cube’s spelll package. In particular I don’t support any indexs, etc. But there are some interesting differences in the automata they use. (w stands for the base word here)

The spelll package creates the Levenshtein Automaton for a single string/word (LA_w), mula uses Universal Levenshtein Automata (ULA).
Spelll computes a DFA from a non-deterministic automaton that uses eplison transitions. ULA do not have epsilon transitions, but for transitions it looks ahead into the base word w. Additionally the NFA’s states/transitions are computable on the fly, so there is no need to store the NFA in memory.
Spelll’s automata transitions using characters. mula computes a bitvector from an input character to transition from states to states. (Computing the bitvector is where the look ahead comes in).
Spelll’s automata return true/false, and uses a separate function to calculate edit distances. Mula uses the automaton itself to calculate edit distances, the outputs have type int option. (LA_w can be modified to support this though!)

References:

On the Levenshtein Automaton and the Size of the Neighborhood of a Word. Hélène Touzet https://hal.archives-ouvertes.fr/hal-01360482/file/LATA2016.pdf
Universal Levenstein Automata: Building and Properties. Petar Nikolaev Mitankin. https://store.fmi.uni-sofia.bg/fmi/logic/theses/mitankin-en.pdf

Regis_Smith · June 22, 2021, 7:51pm

The examples given on the main Github page are very useful. Thanks!

ifazk · June 22, 2021, 8:14pm

Thanks. I just fixed some typos there and added some examples of using the provided functor!

Topic		Replies	Views
[ANN] fuzzy_compare Community announce	0	404	July 16, 2023
[ANN] wu-manber-fuzzy-search 0.1.0 (new library) Community announce , search	0	780	January 18, 2022
Re2ocaml regexp compiler Ecosystem regexp , lexer	12	743	February 18, 2025
Sedlex moved to ocaml-community Ecosystem	15	2354	September 11, 2018
[ANN] reed-solomon-erasure 1.0.1 Ecosystem announce	13	1859	August 1, 2018

[ANN] mula 0.1.0, ML's radishal Universal Levenshtein Automata library

Some details:

Related topics