[ANN] Tyre - type-safe regular expressions 1.0

I am happy to announce the release of Tyre 1.0.

This release makes big changes to the typing: there is a whole host of far more practical combinators allowing you to write code like in a parser combinator library :

let url =
  let+ scheme = opt (const Http (str "http") <|> const Https (str "http") <* str "://") 
  and+ host = rep_any
  and+ port = opt (str ":" *> pos_int)
  and+ path = list (str "/" *> rep_any)
  and+ _ = str "/"
  in
  (scheme, host, port, path)

(this is not a correct url parser by any means, its just to give an idea of how you can use this lib)

This does not allow you to use eval, because only one direction of the conversion is given (there is no code to convert from a url tuple). However in the cases where you don’t need eval this is far easier to write. The difference is typed, you can’t get runtime error by calling eval on the wrong regex.

There are also convencience changes such as charset and the matched_string function, that reduce the need to insert Re bits in your regexp.

Here is the full changelog:

  • Introduce charsets: contrary to Re, they have a different type from regex.
  • Type the difference between regexps that can be evaluated reversed and the
    ones that cannot: (evaluable, 'a) Tyre.t and (non_evaluable, 'a) Tyre.t.
  • Introduce alias type pattern for (non_evaluable, 'a) Tyre.t.
  • Introduce val lift : ('a -> string) -> 'a pattern -> ('e, 'a) t to transform
    a pattern into an expression by giving an explicit conversion function.
    Also liftpp that does the same with better performance by using Format.
  • Introduce val unlift : (evaluable, 'a) t -> 'a pattern.
  • Introduce val either: ('e, 'a) Tyre.t -> ('e, 'b) Tyre.t -> ('e, ('a, 'b) Either.t) Tyre.t.
  • Change the type of alt to (_, 'a) t -> (_, 'a) t -> 'a pattern. Previous
    users of alt should switch to either.
  • Introduce val alt_eval: ('a -> [`Left | `Right]) -> ('e, 'a) t -> ('e, 'a) t -> ('e, 'a) t
    This has flat typing but is compatible with eval.
  • Operators: <|> is alt, <||> is either.
  • Introduce val map : ('a -> 'b) -> (_, 'a) t -> 'b pattern and its
    corresponding operators: let+ and <$>.
  • Introduce (and+) which is an alias of seq.
  • Introduce val app: ('e, 'a -> 'b) t -> ('e, 'a) t -> 'b pattern and its
    corresponding operator <*>
  • Introduce val matched_string : (_, 'a) t -> (_, string) t that discards the
    computed value and just return the string that was matched.
  • Drop dependency on Result library. Stdlib is now used.
  • Introduce val rep_charset: Charset.t -> (_, string) t, and shortcut
    val rep_any: (_, string) t.
5 Likes

One difference between parser combinators and regular expressions is runtime cost: an automaton derived from a regular expression traverses the input once whereas parser combinators typically backtrack on alternatives but they don’t incur an upfront compilation cost. In the worst case, construction of the automaton can be expensive and regex libraries use different approaches to mitigate this. Could you say something about the costs to compile the regular expression to an automaton and the cost of executing this automaton?

This uses Re in the backend, and the runtime cost should be exactly the same + the cost of your conversion functions. Converting the tyre regexp to a Re one, is one-pass and very direct code. There is just some bureaucracy to compute the indexes of the capture groups.

The only case where you can run issues is when using list ( list ...)), because Re does not support groups in rep (well, it does, but only returns the last match). So what Tyre does is “backtrack” to extract the values. The more nested list you have the worse this gets.

I am thinking about extending Re to provide some sort of tree of groups, currently its a list, which is why you can’t do groups in rep cleanly. But I have never contributed to Re, so I don’t know how hard this would be.

Re does compile to an automaton, and does not backtrack (unlike some regexp implementation like PCRE). I ran some benchmark a while ago, and its a little slower than Re2 (google’s C++ regexp library). But a whole lot faster than PCRE (a historical C library that does backtracking).

There are limits in term of what you can parse compared to a parser combinator library. Because there is no fixed point, you can’t parse recursive grammars, only regular language, as expected for a regexp library.

3 Likes