Sedlex moved to ocaml-community

First off, I’m pleased to see sedlex getting some love finally. Very grateful to the community for that.

I would like to add here that my forthcoming Orsetto project includes another alternative to sedlex that might be worth noting, although it has issues and it remains in the “unstable” branch while I’m slowly working on it in my copious spare time.

I would describe it here as follows:

Library Syntax Composition Refill Unicode Automaton Regexs
Orsetto.UCS OCaml Yes Yes Yes Lazy DFA Basic¹

¹: A subset of UTS #18, RL1 (no loose matching, word or line boundaries, etc.)

Also, I’m not sure what “refill” means here, so I didn’t characterize it.

As for the question mentioned above about how to keep the lazy DFA in a Unicode regular expression engine from consuming all the memory in the world, I should say a word here about the approach I took. I used discrete interval sets and maps for both the Unicode property database and the DFA state nodes. These are implemented as static inverse-multiplicative binary search trees in the hopes of avoiding death by cache-miss overload. I took @dbuenzli’s excellent Uucf module and invented my own UTF-8 codec and Unicode normalization functions using the same core data structures in the Orsetto CF framework.