[ANN] Cure2 - combinator frontend for re2

Cure2 is a little library I just made, that provide a combinator frontend to Re2.

With it, instead of "https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&//=]*)",
you can write:

let second_level_char = charset Charset.[Ascii.alnum; chars "-@:%._\\+~#="] in
let top_level_chars = charset Charset.[Ascii.alnum; chars "()"] in
let path_chars = charset Charset.[Ascii.alnum; chars "()@:%_\\+.~#?&/="] in
str "http" + !?(char 's') + str "://"
+ !?(str "www.") + rep ~min:1 ~max:256 second_level_char
+ char '.' + rep ~min:1 ~max:6 top_level_chars
+ bow + rep path_chars
10 Likes

This is great but it feels a bit silly to construct an abstract representation of a regexp, in order to convert it to a string, which will be presumably parsed to construct something isomorphic to the initial representation again, you’d think one could use the initial library directly haha

Maybe its possible if you add C++ bindings, but (ocaml) re2 does not expose the (c++) re2 AST.

Its also probably quite hard to do when this took me 3 days to complete. Also keep in mind the performance is the same, because the conversion to string + parser step should only happen once for good performance.

To be clear that wasn’t a criticism of you at all, it makes perfect sense to do this, I just find it a bit of a shame on an aesthetic level. Maybe it is a criticism of the entire programming world who has settled on strings as the go-to regexp API

I agree that it’s hard to imagine that the performance of constructing regexps should matter in practice. Also, if the regexp is constructed dynamically, it might come down to the same, e.g. I assume Printf.printf "/users/(%s|%s)/blah" foo bar and str "/users/" + (str foo || str bar) + str "/blah" perform essentially the same operations.

If you allow me to think out loud, I have sort of vague memories and mental model for a frontend library for regexps like this one:

  • (a) There’s some heavy lifting in the low-level implementation of the runtime (here it’s a C or C++ backend?)
  • (b) You can have some real nice work to create nice combinators on top - that’s less work than (a) but still requires some time and careful thoughts and craft. Cure2 looks great!
  • (c) And then there’s a final touch up where you create a typed applicative on top of (b). I think this one takes less time than it takes to write (b).

It’s sort of like this:

If ’a t is a type that holds a regexp and enough data, providing some runtime environment, to allow extracting a value of type ā€˜a from an input using the regexp (or fail). So the second part is some function regexp -> input -> .. -> ('a, err) result.

Once you have this, this composes with a append operation, if you have two guys like this ('a t and 'b t), you can build the concatenation of the two regexp, and extract the two values from a input and the combined regexp to build something more complex with the value a & b. So you can build a ('a * 'b) t. You package that into a nice let-syntax interface, and now you can write things like you would any parser with let-syntax (command line, angstrom. etc.).

I think there is one instance of what I’m talking about for re2 here but I can’t find any examples!

To me it sounds like Cure2 is very close to having that extra typed let-syntax layer, and that would be super ergonomic! Note I haven’t tried, it’s just what I was reminded of while reading the conversation, especially the part about aesthetic.

You’re addressing a very, very real pain-point. I am a -rabid- fan of using regexps, and have sometimes written -massive- res. They end up with duplication and just their size alone makes them daunting to read and understand a few months after they’re written.

Notwithstanding, they’re still (to me) far, far better than the equivalent loop-nests. But what would be great, would be a ā€œlittle languageā€ where you could write your regexp in stages, e.g.

slc = [-@:%._\\+~#=]
tlc = [()]
pc = [()@:%_\\+.~#?&/=]
"http(s)?://(www.)?" slc{1,256} "." tlc{1,6} bow pc*

where I’m clearly wildly abusing the notation and syntax, and what I’ve written isn’t from any known RE dialect. But what I mean is,

(a) named sub-expressions

(b) ability to combine named subexpressions in follow-on subexps and the final expression

(c) a bit of an expression-language for combining the sub-expressions.

Maybe for that last line, instead

http(s)?://(www.)?{slc}{1,256}.{tlc}{1,6}{bow}{pc}*

where now ā€œ{}'ā€œ become special chars and enclose subexpression-names. But in the first version, whitespace was not part of the regexp, so probably the regexps need to be quoted if they contain whitespace so that can be distinguished from whitespace-for-formatting-and-readability.

OK, I’m just babbling here, but really what I want to say is, this a real problem that hinders usability for regexps and I’m excited to see that you’re working on addressing it.

I think quoting is very important yes. The downside of my syntax is that even when quoting is enough information, you still have to disambiguate:

"some prefix" (rep any) is enough if you know you are in a regexp context, but you have to write str "some prefix" + (rep any) if you want the syntax to be embedded in ocaml.

I think the ocamllex syntax for regexp is very good: it has the nice and short operators you would expect like *, ? and +; and the string have to be quoted so there no ambiguity between text and operator. This also unlock a very easy syntax for subexpressions: foo matches the regexp defined in foo, and "foo" matches the text foo.

However of course you can’t just use the ocamllex regexp outside of it

1 Like

Tyre is such a typed syntax for re, which is a caml-only regexp combinator library. The issue with it is that I don’t think it has any support for unicode when re2 does. Also, I find the typed aspect only useful if you need to capture things. Otherwise its just distracting.

I didn’t know Tyre, thanks! From what I see it supports bi-directional encoding, so I’d think of it a bit more like a codex (e.g. json_encoding) rather than an applicative like I was talking about.

Ah, interesting! Ok. True I was thinking primarily about capture. For me it’d be more that I’d find some reluctance to have to switch style/approach halfway through if I realize I need to add some extracting capability. Besides, you can just use unit t for a long regexp with no capture?

I took some time this morning to convert your example to that style, to have some example and refresh my mind. It’s probably broken in some ways, this is just an example!

module M = struct
  type t =
    { protocol : [ `Http | `Https ]
    ; address : string
    ; path : string
    }
  [@@deriving sexp_of]
end

let chars str =
  let li = ref [] in
  String.iter str ~f:(fun c -> li := c :: !li);
  List.rev !li |> Re2.Parser.Char.one_of
;;

let second_level_char = Re2.Parser.(or_ [ Char.alnum; chars "-@:%._\\+~#=" ] |> ignore_m)
let top_level_char = Re2.Parser.(or_ [ Char.alnum; chars "()" ] |> ignore_m)
let path_char = Re2.Parser.(or_ [ Char.alnum; chars "()@:%_\\+.~#?&/=" ] |> ignore_m)
let bow = Re2.Parser.(of_re2 (Re2.create_exn "\\b") |> ignore_m)

let p : M.t Re2.Parser.t =
  let ( let+ ) a f = Re2.Parser.map a ~f
  and ( and+ ) = Re2.Parser.both in
  let open Re2.Parser in
  let+ () = start_of_input
  and+ protocol =
    let+ () = string "http"
    and+ s = optional (string "s") in
    if Option.is_some s then `Https else `Http
  and+ () = string "://"
  and+ address =
    (let+ () = optional (string "www.") |> ignore_m
     and+ () = repeat ~min:1 ~max:(Some 256) second_level_char
     and+ () = string "."
     and+ () = repeat ~min:1 ~max:(Some 6) top_level_char in
     ())
    |> capture
  and+ () = bow
  and+ path = repeat path_char |> capture
  and+ () = end_of_input in
  { M.protocol; address; path }
;;

let%expect_test "show-p" =
  print_endline (Re2.Parser.to_regex_string p);
  [%expect
    {| ^http(s)?\:\/\/((www\.)?(?:(?:()([[:alnum:]])|()([\-\@\:\%\._\\\+\~\#\=]))){1,256}\.(?:(?:()([[:alnum:]])|()([\(\)]))){1,6})(?:\b)((?:(?:()([[:alnum:]])|()([\(\)\@\:\%_\\\+\.\~\#\?\&\/\=])))*)$ |}]
;;

let%expect_test "p" =
  (* Test cases curtesy of [ahrefs/cure2:test/test_cure2.ml]. *)
  let extract = Re2.Parser.compile p |> Staged.unstage in
  let test str = print_s [%sexp (extract str : M.t option)] in
  test "https://ahrefs.com/";
  [%expect
    {|
    ((
      (protocol Https)
      (address  ahrefs.com)
      (path     /)))
    |}];
  test "httpss://ahrefs.com/";
  [%expect {| () |}];
  test "https://ocaml.org/p/cure2";
  [%expect
    {|
    ((
      (protocol Https)
      (address  ocaml.org)
      (path     /p/cure2)))
    |}];
  test "https://ocaml.org/p/cu  re2";
  [%expect {| () |}];
  ()
;;

When regular expressions are not constructed dynamically, they could be specified using OCamlLex, which has the ability to build them from named sub expressions.

When you are just building a unit, of lot of the operators have a meaning with regards to what you are capturing, so this get noisy.

Btw I just merged a PR on Tyre to have support for uni-directional, including new simpler operators like let+. You can check out the new interface here : tyre/src/tyre.mli at master Ā· Drup/tyre Ā· GitHub

Thinking about this in the afternoon, I recalled some code using regex I worked on a few months ago. Looking back in fact I didn’t go the typed route there, simply resolved the group-indexed from their field name right after compiling the regex (once), to allow doing the lookup based on integers at runtime (see this pr). It should also be possible to hide the int and group lookup into an abstract intf that makes it less error prone. I don’t want to sound too categorical about the typed approach. You’re right that it probably depends a lot of what you’re building!

Very nice work!