Sedlex moved to ocaml-community

For those of you who use sedlex, it has moved to a new repository on ocaml-community.

This was done with a github transfer, so existing issues against the repository got moved as well, and existing users should be redirected properly. That said, I’ve done a pull request to fix the opam packages and hopefully it will be merged soon.

4 Likes

Thanks!

I think sedlex is a very nice project, bound to be used more and more as the languages we want to parse/implement start using more unicode syntax. Being a ppx makes it lightweight to deploy within a project (integrate in the build system, etc.).

Some remarks:

  • The “contact info” at the start of the README should maybe be changed.

  • It would be nice to point to the examples from the README, and point out that they come with an example of dune setup – I would guess that many users would like to reuse that.

  • I would be mildly curious to see some data on the performance of sedlex compared to ocamllex. Differences could come from the implementation approach and/or from unicode support. On the other hand, to be interpreted properly this information needs to come with the weight of lexing in typical processing pipelines (for example, in a compiler), and I am not actually sure what this cost is – the fact that lexing is typically done lazily during parsing makes a bit more work to measure properly.

  • There is some overlap with existing unicode libraries and existing regexps libraries.

    Could for example Uutf be used at runtime, and Uucp to determine character properties?

    How does the regexp compilation engine compares to what is done in Re? Would it be possible to build a lexer generator on top of an independent regexp library like Tyre or ppx_regexp? (cc @Drup)

1 Like

I also noticed this bit in the README:

  • The OCaml source is assumed to be encoded in Latin1 (for string and character literals).

I think this could lead to surprises as most people today use UTF8 for their source files.

That is a very good question, I have some plans in mind, but they require quite a bit of work. Basically, the current situation is the following:

Library Syntax Composition Refill Unicode Automaton Regexs
ocamlex Custom No¹ Yes No DFA, codegen to C Basic
sedlex PPX No¹ Yes Yes DFA, codegen to OCaml Limited
Re/Tyre OCaml Yes No No³ in-memory NFA with online determinization² Extended⁴
ppx_regex/tyre PPX+OCaml Yes No No - Extended⁴

¹: Some built-in mechanism for locally defining regex exists, but no true composition.
²: There are some things to determinize offline, but they need refreshing
³: [RFC] Add Unicode support by nojb · Pull Request #48 · ocaml/ocaml-re · GitHub
⁴: Within regularity. Lacks full blown complementation. See also this paper.

My plans would not be to try to improve sedlex, but rather to push re (and the related libraries) to the point where it’s universally better. Ppx_regexp/tyre provides a convenient “ocamlex” like syntax while preserving composition. The first step would be a refill mechanism, and support for UTF (for which @nojb made a prototype that would need revival).

Performances are a tricky question. Ocamllex is clearly faster, since it generates a C-based DFA. I expect sedlex to be faster than re in small examples, but it would need evaluation. Online determinization is very desirable in many contexts.

wrt. Unicode libraries: At least for sedlex, it was designed so that bunzli’s libraries can be used before giving the stream to sedlex. Either to re-encode, or to normalize. I think that’s a decen way of doing things.

5 Likes

One other thing that sedlex misses wrt ocamllex are “tags” (the as operator).

Yes, sedlex’ regular expressions are quite limited and don’t support submatching or any extension. ocamlex as submatching. Re also add several additional operators, but no generalized complementation/difference. Dreml (the paper listed above) has both submatching and complementation but doesn’t determinize.

The state of the art in regular expression engines is surprisingly incomplete. Nobody really figured out a proper way to have (longest) submatch, complementation and online determinization in a way that doesn’t explode with unicode char classes. To interested people in the room … :wink:

1 Like

Just a few notes:

  1. The ocaml-community repository really is for the community’s benefit. If you see problems in the sedlex README or elsewhere, please submit a pull request, or, if you don’t have time for that, open an issue. We will get to it as quickly as we possibly can.

  2. I think that feature parity with ocamllex is necessary going forward. sedlex should support the full range of regexps that can be handled in efficient time (so things like ^ and $ yes, and not things like backrefs), submatch capture, “normal” regexp syntax (via the {|string|} mechanism), etc. etc. Doing all this will require significant work, and that means interested people have to step forward.

  3. In terms of efficiency of the generated code, in principle. OCaml has no features that would prevent generating code as efficient as good C code for lexical analysis. In practice, of course, our compiler isn’t good enough. This might be an interesting opportunity for someone who would like to show what the OCaml compiler is capable of given some tuning.

I’ve done a small update to README.md to fix these two things. Please let me know if there are other things you would like fixed (or open an issue or a PR.)

At the moment, OCaml source is always assumed to be in Latin-1, per the manual. People often write UTF-8 anyway, but technically this doesn’t work. Also it turns out that the encoding situation isn’t entirely neutral, as some transformers (like jsoo and bucklescript) need to convert strings to JavaScript strings, which use utf16, so the question really needs to be settled at some point.

See also: Make the character set for OCaml source code officially UTF-8. by pmetzger · Pull Request #1802 · ocaml/ocaml · GitHub which I really should get back to working on at some point soon.

1 Like

By this, do you just mean submatch capture with as?

First off, I’m pleased to see sedlex getting some love finally. Very grateful to the community for that.

I would like to add here that my forthcoming Orsetto project includes another alternative to sedlex that might be worth noting, although it has issues and it remains in the “unstable” branch while I’m slowly working on it in my copious spare time.

I would describe it here as follows:

Library Syntax Composition Refill Unicode Automaton Regexs
Orsetto.UCS OCaml Yes Yes Yes Lazy DFA Basic¹

¹: A subset of UTS #18, RL1 (no loose matching, word or line boundaries, etc.)

Also, I’m not sure what “refill” means here, so I didn’t characterize it.

As for the question mentioned above about how to keep the lazy DFA in a Unicode regular expression engine from consuming all the memory in the world, I should say a word here about the approach I took. I used discrete interval sets and maps for both the Unicode property database and the DFA state nodes. These are implemented as static inverse-multiplicative binary search trees in the hopes of avoiding death by cache-miss overload. I took @dbuenzli’s excellent Uucf module and invented my own UTF-8 codec and Unicode normalization functions using the same core data structures in the Orsetto CF framework.

1 Like

Any chance of breaking that work out?

Unfortunately putting that inside Orsetto means that it will not be used directly. I would prefer efforts towards improving Re, or providing a replacement to it. This is, by the way, a general comment on your work. You seem to dedicate a lot of energy to it, and it seems great, but so far you have kept it in your own little library while existing commonly used libraries could benefit strongly from these new clean and carefully crafted implementations.

That being said, this looks very interesting and I would be very curious to have a comparison between your lazy DFA implementation and the one in Re! Do you handle capture groups?

As for the refill mechanism, it’s related to streaming. For most re engine, this is handled by having the text in a buffer that can be refilled when the end is reached.

1 Like

I’m seriously considering a release plan for Orsetto that would begin with an OPAM package that contains just the Core Foundation library. It depends on what I’ll be ready to commit to supporting for community adoption. The discrete interval sets and maps I discussed above are one of the general-purpose data structures in that framework. It’s the closest to readiness for public consumption. The Unicode support library is further away, and the CBOR and JSON frameworks are further still.

p.s. The lazy DFA is also part of the CF framework, and so— for the moment, at least— is the regular expression framework for 8-bit extended ASCII texts.

I’m fine with that. I’m just one person working on Orsetto in my spare time left over between my day job and a rich personal life. I can’t yet commit to maintaining any parts of Orsetto for the community because the interfaces are still unstable.

My response to your critique is to agree: efforts at improving Re or providing a replacement for it would be very welcome. If people taking up those efforts want to reuse any parts of Orsetto I’m mentioning here, then I would be supportive. It’s the reason I licensed the code with BSD 2-clause— to make that as easy as possible.

Ah, I see. I’ve updated my comment reflect the presence of that capability.

In my design, the text buffering logic is separated from the DFA engine and the parser combinator. It makes extensive use of a functional stream type (which will be replaced with Stdlib.Seq.t at some point before I release). I suppose better performance might be had by tightly coupling the text buffering logic with the DFA engine and the parser combinators, but I haven’t taken any measurements.