Recently I learned (from @Stephane_Glondu ) that pcre (in Debian) is now obsolete, superseded by pcre2. Also, recently @tobil4sk ported the pcre-ocaml code to work with pcre2. The original maintainer of pcre is busy with other things, so neither he nor @tobil4sk are interested in maintaining pcre2-ocaml.
I volunteered to do it, but I figured I should first ask if anybody else wants to do it, just b/c … well, b/c it seems a little presumptuous to just jump in on something like this. I personally want to see pcre2-ocaml exist and be maintained, b/c I use pcre often enough when re isn’t enough (e.g. doesn’t support the regexps I use).
Random question (sorry): would it even make sense to merge the efforts of pcre2-ocaml into re? I get that re, currently, is a pure-OCaml regex engine, that can used with several concrete syntaxes (including a subset of PCRE, argued to be more efficient than PCRE in its full glory (but perhaps the comparison in re’s readme uses the original PCRE engine, and PCRE2 has much improved that aspect?)), whereas pcre2-ocaml is a binding to an existing C library. Still, having both packed together might make sense. Then re would fully support the PCRE2 syntax, and could switch engines on need. And users would not have to pick a library.
I generally agree with @glen. Even though re is not at par with pcre, I have the feeling that the extra features are not always needed and would prefer a switch to re if it is not the case.
The latest 1.11.0 release of re improves compatibility with pcre. In particular, I’ve added support for named groups and some control characters (but not \Cx nor \ddd which are subsumed by \xdd). However, one notable feature is missing in re: back references in regexps, and it is not trivial to add (I’m less at ease to implement them).
So pcre2-ocaml could still be useful if these back references are actually needed… until re supports them. Maybe @vouillon could tell us more about this?
Does re support utf8 strings? So that, for instance, a multibyte character can appear in a character class and be treated as one character rather than several? pcre has a flag to behave like this, but I don’t see anything analogous in re.
I’ve queued a PR to release pcre2-ocaml. I didn’t do much review of the code – just released it so I can get going with other packages that depend on it for testing.
If you’re trying to avoid the pcre wrapper by any means necessary (maybe because of the bug you pointed out in another thread), I believe you should be able to use ulex with some massaging. ulex obviously knows about unicode, but (of course) its regexps lack the power of pcre.
Thanks for the suggestion! At the moment I’m successfully working around the bug, and the pcre syntax is more familiar to me (and more powerful), so I’m sticking with pcre.