[ANN] ppx_regexp 0.2.0 and 0.3.0

This is the first release of ppx_regexp, a ppx extension which turns OCaml patterns into invocations to the Re library. It supports named bindings of either string or string option type depending on whether the capture is guaranteed or not. The ppx also serves to lift the compilation of the regular out of abstractions and functors and up to the file-level module.

This extension should be useful as a data-mining tool, due to it’s terse syntax and the efficiency of the Re library on which it is based. I have previously used mikmatch with great success, and I wanted a ppx-based replacement. Fortunately the needed ingredients were now already available in the Re library (with Re.Mark) and the ppx framework.

Note that there are also some limitations:

  • Pattern guards are not supported. Since the full match or function is compiled into a single regular expression, at the point where the Re.exec function finds a match and terminates, we can no longer backtrack in case the guard condition fails.
  • No exhaustiveness checks are done. Instead the rewriter will issue a warning if no match-all case is provided. Exhaustiveness checks for regular expressions seems like an interesting problem, but not one that I will look into myself at this point.

There is also a combinator-based interface to Re, tyre. The transformation done by ppx_regexp is very similar to what tyre does, so I’d expect about the same performance.

3 Likes

I’m quite happy to see a ppx version of the Re.Mark-based routing.

Re contains various functions to produce the complete automatons. If we were to add Complement/intersection (which is in theory feasible for derivative-based regular expression libraries, although groups make things messy) we could easily compute “missing matches”. Generalizing Re.compl and Re.inter is quite desirable, but nobody figured out how to really do it in Re, unfortunately.

At the moment, you can’t use things from type Re.t in your ppx and you can’t use regex defined with your ppx inside combinators. This restricts composition a lot. With other syntax based regex (Re_posix and others), you can at least do the later. Do you think you could expose your syntax extension to produce and use combinators? Maybe by (ab)using tyre’s converters ?

Do you plan to handle groups under repetitions? That was particularly painful to do in Tyre. Alain frisch has a paper explaining how to do it in a regex engine directly, but adding that to Re is probably quite non trivial…

That first part sounds encouraging. If we focus on exhaustiveness checks / counter-example some simplifications should be possible. At least we could ignore the groups since back-references are not supported. And a less efficient algorithm than Re’s automaton (derivation-based?) would do if it helps. On the other hand, I’m not even sure how important it is; at least where I used mikmatch, the patterns were always needles (log messages), so I needed the catch-all cases anyway.

Could the choice of Re_posix instead of Re_pcre have made the regex defined with ppx_regexp reusable inside combinators? I don’t think I understood that point.

Revising my opinion here: I think a string-based patterns would work well with tyre. There are a few elementary types supported and the higher order patterns would be the usual *, ?, etc. One issue though would be whether to still allow pattern variables nested under repeats. I think x in {|<(((?<x:int>\d+),)*;)*>|} should have type int list list, but the question could be avoided by only allowing top-level named capture.

I didn’t have much ambition there; I checked what the Re library did and relayed the behaviour. That is, returning the last match if any. I think the algorithm which gives an option to the string type in case of any surrounding * or ? or {0,n} could be generalised to turn nested *, {i,j} and ? into nested list, list, and option types, but I think that should first be supported by the Re library. (Thanks for the link, I just had a glance, but will look at it more closely.)

No. I simply meant that you could write things like Re.(alt [Re_posix.regex "[abcd]{2,4}" ; opt alpha]), so you have composition in one direction. It’s true for all Re_* module.

As far as I can see, tyre is more expressive, so it should be fairly easy to simply desugar down to combinators (just like the Re_* modules, but with more types!). The example in your readme:

function%pcre
   | {|(?<t>.*:\d\d) .* postfix/smtpd\[[0-9]+\]: connect from (?<host>[a-z0-9.-]+)|} ->
      Lwt_io.printlf "%s %s" t host
   | _ ->
      Lwt.return_unit)

is equivalent to

let p s = Tyre.regex (Re_perl.re ?opts s)
let re = Tyre.(route [
  p".*:\d\d" <* p" .* postfix/smtpd\[[0-9]+\]: connect from " *> p"[a-z0-9.-]+" -->
    (fun (t, host) -> Lwt_io.printlf "%s %s" t host) ;
  p".*" --> 
    fun _ -> Lwt.return_unit ;
])

Which is barely longer, but I understand the syntax aspect :stuck_out_tongue:.
That doesn’t completely solve the composibility problem, though, since Tyre.route doesn’t produce composable objects either. The nesting works too (Tyre.(list (list (int <* str",") <* str";")) but it’s inefficient (it re-matches, hence the last point in my previous reply), so it’s debatable.

I find ppx_regexp very interesting and simplicity is a good part of why I think it’s cool. This brings OCaml closer to perl for writing text manipulating scripts (or bigger programs). Tyre has the cool factor of full-steam GADT, but clearly your snippet loses all the simplicity in the translation.

1 Like

I see your point, and I think on both accounts what is lacking is for ppx_regexp do do it’s own RE parsing, since that would both allow substituting predefined Re.ts into patterns and to construct tyre combinators, but restricting named capture to the top-level.

Thanks, may aim was exactly an opportunistic combination of existing technologies into something good enough for practical use, though Tyre and Angstrom is way cooler on the semantic level. That’s not to say we could not also have a match%tyre.

@c-cube I completely agree. My remark was about getting the best of both worlds: Familiar/terse pcre syntax with ppx_regexp, which desugars to tyre/re combinators to get good composition. Just like Re_* modules.

@paurkedal Yes, exactly. You could probably reuse re’s parser.

Can you install this with opam ?

Yes, did you encounter any problems? It should work with 4.02.3, 4.03.0 and 4.04.1.

My bad, I had not updated opam and ppx_regexp wasn’t listed! It installed fine, thanks.

1 Like

I found a bug in 0.2.0 and 0.3.0, which is addressed by the 0.3.1 which is now available in opam. The changes are:

  • Fix accidental shadowing of open from another interface-less module using
    ppx_regexp.
  • Support binding of group 0 and the universal pattern.
  • Switch to ppx_tools_versioned. This provides support for 4.02.3 in the
    main branch.