Re2ocaml regexp compiler

Regular expression compiler re2c that was originally written in 1993 for C/C++ now supports OCaml.

A short intro from the official website: re2c stands for Regular Expressions to Code. It is a free and open-source lexer generator that supports C, C++, D, Go, Haskell, Java, JavaScript, OCaml, Python, Rust, V, Zig, and can be extended to other languages by implementing a single syntax file. The primary focus of re2c is on generating fast code: it compiles regular expressions to deterministic finite automata and translates them into direct-coded lexers in the target language (such lexers are generally faster and easier to debug than their table-driven analogues). Secondary re2c focus is on flexibility: it does not assume a fixed program template; instead, it allows the user to embed lexers anywhere in the source code and configure them to avoid unnecessary buffering and bounds checks. Internal algorithm used by re2c is based on a special kind of deterministic finite automata: lookahead TDFA. These automata are as fast as ordinary DFA, but they are also capable of performing submatch extraction with minimal overhead.

There is a detailed user guide and and an online playground with many examples.

6 Likes

If you are familiar with this, could you summarise what regular expressions are supported? Does this support for strings only? What about UTF8? Itā€™s currently not difficult to use OCamlLex to generate a regular expression matcher ahead of time (table driven).

If you are familiar with this, could you summarise what regular expressions are supported?

[Disclaimer: Iā€™m the author of the tool. Since Iā€™m newly registered here, I can only have 2 links in a post so Iā€™ll have to make a follow-up post with links.]

re2ocaml does not use POSIX BRE/ERE and instead it has its own RE syntax (see below). IT has named definitions for sub-RE. It has optional capturing groups with leftmost-greedy or POSIX longest-match semantics (all compile down to TDFA) or standalone position markers (tags). Thereā€™s no built-in Unicode categories, but they are provided via a standard library of definitions.

So, something like abc[de](fg)? would be "abc" [de] "fg"? Thereā€™s also limited support for Flex-like syntax.

  • "foo" Case-sensitive string literal.
  • 'foo' Case-insensitive string literal.
  • [a-xyz], [^a-xyz] Character class (possibly negated).
  • . Any character except newline.
  • R \ S Difference of character classes R and S.
  • R* Zero or more occurrences of R.
  • R+ One or more occurrences of R.
  • R? Optional R.
  • R{n} Repetition of R exactly n times.
  • R{n,} Repetition of R at least n times.
  • R{n,m} Repetition of R from n to m times.
  • (R) Just R; parentheses are used to override precedence. If submatch extraction is enabled, (R) is a capturing or a non-capturing group depending on --invert-captures option.
  • (!R) If submatch extraction is enabled, (!R) is a non-capturing or a capturing group depending on --invert-captures option.
  • R S Concatenation: R followed by S.
  • R | S Alternative: R or S.
  • R / S Lookahead: R followed by S, but S is not consumed.
  • name Regular expression defined as name (or literal string "name" in Flex compatibility mode).
  • {name} Regular expression defined as name in Flex compatibility mode.
  • @stag An s-tag: saves the last input position at which @stag matches in a variable named stag.
  • #mtag An m-tag: saves all input positions at which #mtag matches in a variable named mtag.

Character classes and string literals may contain the following escape sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexadecimal escapes \xhh, \uhhhh and \Uhhhhhhhh.

Does this support for strings only?

No, it is very flexible when it comes to input. It may be a string, a byte buffer, buffered file input (the file may be read in chunks [1], it may be a channel [2]), an array of integers [3]) and in general anything that can be expressed via a set of primitive operations. Not all operations need to be defined - only those used by the generated code.

What about UTF8?

It is supported [4] along with a few other encodings. re2ocalml compiles UTF-8 decode into the DFA, it does not use any pre-processing.

Itā€™s currently not difficult to use OCamlLex to generate a regular expression matcher ahead of time (table driven).

I canā€™t say from personal experience, but re2c users requested support of OCaml [5]) and when I asked them the same question, hereā€™s what they said:

Do you see any of these features lacking in existing OCaml generators?

  • ocamllex lacks unicode support
    is not reentrant
    does not have a drop-in UI

  • sedlex lacks submatch extraction
    does not have a drop-in UI

Iā€™ve found the flexible UI to re2c to be the killer feature. Being able to define:YYā€¦ its primitives inline with the rest of my code makes it very convenient to use. This is even without considering the performance argument.

My hope is that re2ocaml should be fairly fast, but I donā€™t have any benchmarks yet.

Hopefully that answers your question, but Iā€™m happy to clarify if not.

4 Likes

Iā€™ve used re2c previously for C code, interesting to see it now supports OCaml.

Another tool in the same category would be ragel, which also has OCaml support (but TBH I never used it, because the package is not easily available on opam).

Itā€™d be interesting to compare the efficiency and flexibility of the generated code by both ragel and re2c.

Although from the re2c website: ā€œRagel programs are fast, but not always correct[ā€¦] Actions on different paths conflict with each other (in particular, an action can conflict with itself) and change the program state in unforeseen ways (e.g. overwrite variables set by another action)ā€.
I think it is very important for a regular expression compiler to generate correct output (it is a minimum requirement, even before looking at performance), so Iā€™m glad that re2c has more focus on correctness than ragel.

Do you know how the generated OCaml code compares? In particular is the generated OCaml code by re2c thread/domain-safe?

Itā€™d be interesting to compare the efficiency and flexibility of the generated code by both ragel and re2c.

Itā€™s a good comparison. So far I only have benchmarks for submatch extraction and only for C, but Iā€™m also curious to compare re2ocaml with other tools, so Iā€™m going to work on more general benchmarks now covering different languages.

Although from the re2c website: ā€œRagel programs are fast, but not always correct[ā€¦] Actions on different paths conflict with each other (in particular, an action can conflict with itself) and change the program state in unforeseen ways (e.g. overwrite variables set by another action)ā€.
I think it is very important for a regular expression compiler to generate correct output (it is a minimum requirement, even before looking at performance), so Iā€™m glad that re2c has more focus on correctness than ragel.

Please note that Ragel programs may be incorrect in the case of submatch extraction, not just in any arbitrary case. This is because it uses an DFA, while re2c uses TDFA that was designed with both correctness and efficiency in mind.

For the record, hereā€™s a longer quote: ā€œRagel programs are fast, but not always correct. This is because ragel cannot handle non-determinism in the general case. It is a fundamental limitation which cannot be solved with disambiguation operators, except for simple cases. Ragel has no notion of tags and tag variables, and its submatch extraction is based on actions ā€” arbitrary blocks of code embedded in the regular expression. When ragel generates a DFA, it puts actions either on transitions or in states (depending on the action type). Non-determinism means that there are multiple parallel NFA paths with different actions that reach a given DFA state. Actions on different paths conflict with each other (in particular, an action can conflict with itself) and change the program state in unforeseen ways (e.g. overwrite variables set by another action). Disambiguation operators can remove some of the conflicting actions, but they cannot fork the program state.ā€

Do you know how the generated OCaml code compares? In particular is the generated OCaml code by re2c thread/domain-safe?

It should be thread-safe. The generated automaton itself is a bunch of co-recursive functions that pass lexer state yyrecord to one another. State definition is up to the user (so itā€™s possible to make it non thread-safe on purpose), but typically it should be a struct with a few immutable fields (like yyinput, the input character sequence) and a few mutable fields (like yycursor, the current position in the input that is updated by the functions). The struct is created and initialized by the user, and all resources (such the input buffer, if the input is read from file) are allocated by the user in the way that suits them. Thereā€™s no global mutable state hidden anywhere.

Regarding domain-safe, I believe the above applies to it as well, but Iā€™m not familiar enough with the concept to say for sure.

1 Like

Hereā€™s a very simple benchmark: re2ocaml vs ragel-ocaml on recognizing a decimal integer (not parsing, no file input, no buffers - simply running a very primitive lexer function on the same string many times in a loop). Itā€™s just to get the general idea, not a real benchmark yet.

x.rl (ragel code, one action leaving the state machine for integer):

%%{
  machine atoi;
  write data;
}%%

let atoi data =
  let cs = ref 0 in
  let p = ref 0 in
  let pe = ref (String.length data) in
  let res = ref false in

  %%{
    main := digit+ %{ res := true; } '\n';

    write init;
    write exec;
  }%%

  !res

let () =
  let s = "1234567890\n" in

  (* if not (atoi s) then raise (Failure "error"); *)

  for i = 1 to (10 * 1000 * 1000) do
    ignore (atoi s)
  done

y.re (re2ocaml code):

open String

type state = {
    yyinput: string;
    mutable yycursor: int;
    mutable yymarker: int;
}

%{
    re2c:YYFN = ["lex;bool", "yyrecord;state"];
    re2c:yyfill:enable = 0;

    [0-9]+ [\n] { true }
    *           { false }
%}

let main () =
    let s = "1234567890\n" in

    (* let st = {yyinput = s; yycursor = 0; yymarker = 0}
       in if not (lex st) then raise (Failure "error"); *)

    for i = 1 to (10 * 1000 * 1000) do
      let st = {yyinput = s; yycursor = 0; yymarker = 0}
      in ignore (lex st)
    done

let _ = main ()

Compilation (re2ocaml version 4.0, ragel version 7.0.4 and I used the fastest available mode -F1):

$ ragel-ocaml -F1 x.rl -o x.ml && ocamlopt x.ml -o x
$ re2ocaml -i y.re -o y.ml && ocamlopt y.ml -o y

Results:

$ time ./x

real    0m0.684s
user    0m0.683s
sys     0m0.000s

$ time ./y

real    0m0.172s
user    0m0.171s
sys     0m0.001s

So re2ocaml is quite a bit faster. Itā€™s easy to explain though, ragelā€™s real fast mode -G2 is not implemented for ocaml, so it generates a table-based lexer which has one extra dereference to make on each character (indirect jump in the table). Perf stat confirms this (look at branch counts):

$ perf stat ./x

 Performance counter stats for './x':

            671.07 msec task-clock:u                     #    0.999 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               124      page-faults:u                    #  184.781 /sec                      
     2,523,955,805      cycles:u                         #    3.761 GHz                       
    10,350,306,836      instructions:u                   #    4.10  insn per cycle            
     2,270,066,024      branches:u                       #    3.383 G/sec                     
             9,109      branch-misses:u                  #    0.00% of all branches           

       0.671469889 seconds time elapsed

       0.669477000 seconds user
       0.001995000 seconds sys

$ perf stat ./y

 Performance counter stats for './y':

            172.91 msec task-clock:u                     #    0.996 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               650      page-faults:u                    #    3.759 K/sec                     
       649,827,957      cycles:u                         #    3.758 GHz                       
     2,671,586,889      instructions:u                   #    4.11  insn per cycle            
       580,292,838      branches:u                       #    3.356 G/sec                     
             7,586      branch-misses:u                  #    0.00% of all branches           

       0.173548209 seconds time elapsed

       0.172154000 seconds user
       0.001000000 seconds sys

I also find the code re2ocaml generates easier to read, but it may be my familiarity bias:

Ragel can definitely improve; thereā€™s no reason in principle why it canā€™t implement fast mode for OCaml (it would have to be based on tail recursion as in re2ocaml, as thereā€™s no goto in OCaml). In general ragerā€™s fast mode and re2c are very close in speed and both very fast (for C, which one is faster often depends on the C compiler, e.g. GCC vs Clang may give different results).

This is great.

Here are some small suggestions for improvements:

  • Use Sys.opaque_identity in the benchmark loop to ensure that flambda and other optimizations in OCaml donā€™t discard the lexer completely (they arenā€™t arenā€™t smart enough to figure that out yet though, but for future-proofing)
  • pass yyinput and yycursor directly as arguments to all yy* functions: this avoids one level of indirection through a record, and allows to keep the current position in a register. See yycursor.ml and yyargs.ml: yyargs.ml Ā· GitHub
  • the above changes the type signature of lex, if needed for backwards compatibility this function could still take a yyrecord as input, make the call, and update the position.
hyperfine --warmup 2 _build/default/orig.exe _build/default/yycursor.exe _build/default/yyargs.exe _build/default/yyargs_compat.exe
Benchmark 1: _build/default/orig.exe
  Time (mean Ā± Ļƒ):     114.9 ms Ā±   0.2 ms    [User: 113.6 ms, System: 1.0 ms]
  Range (min ā€¦ max):   114.6 ms ā€¦ 115.7 ms    25 runs
 
Benchmark 2: _build/default/yycursor.exe
  Time (mean Ā± Ļƒ):     113.9 ms Ā±   0.3 ms    [User: 112.2 ms, System: 1.3 ms]
  Range (min ā€¦ max):   113.7 ms ā€¦ 115.4 ms    26 runs
  
Benchmark 3: _build/default/yyargs.exe
  Time (mean Ā± Ļƒ):     104.7 ms Ā±   0.3 ms    [User: 103.9 ms, System: 0.6 ms]
  Range (min ā€¦ max):   103.9 ms ā€¦ 105.3 ms    28 runs
 
Benchmark 4: _build/default/yyargs_compat.exe
  Time (mean Ā± Ļƒ):     106.2 ms Ā±   0.1 ms    [User: 104.9 ms, System: 1.0 ms]
  Range (min ā€¦ max):   105.8 ms ā€¦ 106.4 ms    28 runs
 
Summary
  _build/default/yyargs.exe ran
    1.01 Ā± 0.00 times faster than _build/default/yyargs_compat.exe
    1.09 Ā± 0.00 times faster than _build/default/yycursor.exe
    1.10 Ā± 0.00 times faster than _build/default/orig.exe

Of course eliminating bounds checks by using String.unsafe_get instead of String.get can result in a further speedup, but then you need to trust the lexer to stop:

hyperfine --warmup 2 _build/default/yyargs_unsafe.exe _build/default/yyargs.exe           
Benchmark 1: _build/default/yyargs_unsafe.exe
  Time (mean Ā± Ļƒ):      71.5 ms Ā±   0.2 ms    [User: 70.7 ms, System: 0.7 ms]
  Range (min ā€¦ max):    71.4 ms ā€¦  72.8 ms    41 runs
 
Benchmark 2: _build/default/yyargs.exe
  Time (mean Ā± Ļƒ):     104.9 ms Ā±   0.4 ms    [User: 103.9 ms, System: 0.7 ms]
  Range (min ā€¦ max):   104.3 ms ā€¦ 106.4 ms    28 runs
 
Summary
  _build/default/yyargs_unsafe.exe ran
    1.47 Ā± 0.01 times faster than _build/default/yyargs.exe

2 Likes

Thanks, thatā€™s useful! You can do all of the above with the current re2ocaml, you just need to use generic API instead of the default record API to have a bit more freedom in the way you define primitive operations (see section ā€œProgram interfaceā€ in the user guide - discuss wonā€™t let me post a link), and change YYFN prototype as follows:

%{
    re2c:api = generic;
    re2c:define:YYFN = ["lex;bool", "str;string", "cur;int", "mar;int"];
    re2c:define:YYPEEK = "String.unsafe_get str cur";
    re2c:define:YYSKIP = "let cur = cur + 1 in";
    re2c:define:YYBACKUP = "let mar = cur in";
    re2c:define:YYRESTORE = "let cur = mar in";
    re2c:yyfill:enable = 0;

    [0-9]+ [\n] { true }
    *           { false }
%}

let main () =
    let s = "1234567890\n" in

    if not (lex s 0 0) then raise (Failure "error");

    for i = 1 to (10 * 1000 * 1000) do
      ignore (Sys.opaque_identity (lex s 0 0))
    done

let _ = main ()

The generated code is 2x fast compared to my original version:

let rec yy0 (str : string) (cur : int) (mar : int) : bool =
    let yych = String.unsafe_get str cur in
    let cur = cur + 1 in
    match yych with
        | '0'..'9' -> (yy3 [@tailcall]) str cur mar
        | _ -> (yy1 [@tailcall]) str cur mar

and yy1 (str : string) (cur : int) (mar : int) : bool =
    (yy2 [@tailcall]) str cur mar

and yy2 (str : string) (cur : int) (mar : int) : bool =
    false

and yy3 (str : string) (cur : int) (mar : int) : bool =
    let mar = cur in
    let yych = String.unsafe_get str cur in
    match yych with
        | '\n' ->
            let cur = cur + 1 in
            (yy4 [@tailcall]) str cur mar
        | '0'..'9' ->
            let cur = cur + 1 in
            (yy5 [@tailcall]) str cur mar
        | _ -> (yy2 [@tailcall]) str cur mar

and yy4 (str : string) (cur : int) (mar : int) : bool =
    true

and yy5 (str : string) (cur : int) (mar : int) : bool =
    let yych = String.unsafe_get str cur in
    match yych with
        | '\n' ->
            let cur = cur + 1 in
            (yy4 [@tailcall]) str cur mar
        | '0'..'9' ->
            let cur = cur + 1 in
            (yy5 [@tailcall]) str cur mar
        | _ -> (yy6 [@tailcall]) str cur mar

and yy6 (str : string) (cur : int) (mar : int) : bool =
    let cur = mar in
    (yy2 [@tailcall]) str cur mar

and lex (str : string) (cur : int) (mar : int) : bool =
    (yy0 [@tailcall]) str cur mar



let main () =
    let s = "1234567890\n" in

    if not (lex s 0 0) then raise (Failure "error");

    for i = 1 to (10 * 1000 * 1000) do
      ignore (Sys.opaque_identity (lex s 0 0))
    done

let _ = main ()

With regard to unsafe_get:

Of course eliminating bounds checks by using String.unsafe_get instead of String.get can result in a further speedup, but then you need to trust the lexer to stop.

re2ocaml supports a few different ways to check for the end of input described in the manual section ā€œHandling the end of inputā€ (in our example \n acts like a sentinel). Provided that these guidelines are followed, I recommend using unsafe_get, as doing bounds check on every character would be a waste and duplication of what the lexer already does.

You could also have some function yyget that is defined to String.get in debug mode, and String.unsafe_get in release mode.