That is a very good question, I have some plans in mind, but they require quite a bit of work. Basically, the current situation is the following:
Library |
Syntax |
Composition |
Refill |
Unicode |
Automaton |
Regexs |
ocamlex |
Custom |
No¹ |
Yes |
No |
DFA, codegen to C |
Basic |
sedlex |
PPX |
No¹ |
Yes |
Yes |
DFA, codegen to OCaml |
Limited |
Re/Tyre |
OCaml |
Yes |
No |
No³ |
in-memory NFA with online determinization² |
Extended⁴ |
ppx_regex/tyre |
PPX+OCaml |
Yes |
No |
No |
- |
Extended⁴ |
¹: Some built-in mechanism for locally defining regex exists, but no true composition.
²: There are some things to determinize offline, but they need refreshing
³: https://github.com/ocaml/ocaml-re/pull/48
⁴: Within regularity. Lacks full blown complementation. See also this paper.
My plans would not be to try to improve sedlex, but rather to push re
(and the related libraries) to the point where it’s universally better. Ppx_regexp/tyre provides a convenient “ocamlex” like syntax while preserving composition. The first step would be a refill mechanism, and support for UTF (for which @nojb made a prototype that would need revival).
Performances are a tricky question. Ocamllex is clearly faster, since it generates a C-based DFA. I expect sedlex to be faster than re in small examples, but it would need evaluation. Online determinization is very desirable in many contexts.
wrt. Unicode libraries: At least for sedlex, it was designed so that bunzli’s libraries can be used before giving the stream to sedlex. Either to re-encode, or to normalize. I think that’s a decen way of doing things.