How to convert Regexp to NFA in OCaml?

joelj · November 4, 2018, 8:30pm

I’m learning OCaml and want to implement a simple regexp engine with support for basic operations such as union, concat and kleene star.

I found this nice exercise: https://www.cl.cam.ac.uk/teaching/1718/L28/exercise3.pdf
But I cannot find the solutions anywhere.

If anyone knows of any similar small regexp -> NFA implementation with sources available, that would be highly appreciated.

Thanks.

ins · November 4, 2018, 8:39pm

This slide deck is pretty good: https://www.cs.umd.edu/class/fall2018/cmsc330/lectures/02-automata.pdf

lindig · November 4, 2018, 9:25pm

Note that the exercise is using Meta OCaml – which is different from OCaml. You are probably aware of the Re library which is a complete regular expression library but it goes beyond the basics (I have not looked at it in detail). I would be interested in a discussion how to implement the non-deterministic automaton (not its construction from a regular expression) and its execution in OCaml as I can think of several approaches.

perry · November 5, 2018, 12:40am

I think you’ll find that there’s a reluctance to post actual solutions to academic homework problems online, for obvious reasons.

That said, if this isn’t homework but you’re truly interested in the topic, I recommend reading the appropriate section of the Dragon Book and also this series of articles by Russ Cox. Neither focuses on the use of OCaml for this purpose, but that’s not important to understanding the algorithms involved. From those two references, it should be pretty straightforward to turn an AST of a regexp into an NFA.

As for how you parse regular expressions into an AST suitable for NFA conversion, I’d suggest that a simple recursive descent parser works well for fairly simple regex formats; you can treat individual characters in the input as tokens quite successfully. I wrote such a parser in OCaml as an exercise a while back and it was pretty straightforward. The recursive descent technique is explained pretty broadly online.

If this isn’t a homework problem, the above should be able to get you started.

joelj · November 5, 2018, 7:45am

This isn’t homework, I’m genuinely interested in compilers

I had read the brilliant articles by Russ Cox already, that’s actually how this started! I understood why the regexp engines in modern languages are so complex, since it’s due to the need to support complex features such as back-references.

The “Implementation” section of the article uses C for all code. Since it’s using lots of for loops and if statements,
it’s therefore difficult for me as an OCaml beginner to translate it into idiomatic OCaml code.

I think many of the points made by Russ Cox back in 2007 are still valid 2018. Not all regexp expressions makes use of all the complex features, so maybe there is a market for such a library like RE2, but implemented in pure OCaml.

I think a series of articles based on Russ Cox’s original one, but where OCaml instead of C is used as the language for the reference implementation, would be very interesting to a lot of people, who are looking at OCaml as a candidate language for their parser/compiler projects.

perry · November 5, 2018, 1:51pm

Ignore the code and instead focus on what the algorithms are. Most of it is straightforward graph processing.

It might be interesting, but someone would need to take the time to write it, it’s a non-trivial undertaking.

joelj · November 5, 2018, 2:53pm

If it would be possible to somehow pay someone to spend the necessary time on it, I would happily do so. The work could with benefit be made open source under MIT license. Please let me know if someone in this forum would be interested.

Drup · November 5, 2018, 3:30pm

Russ Cox’s set of articles are very good if you intend to make a fast, practical, regex engine, but they are overkill if you just want an introduction.

Turning a regex into an NFA is often done using Thomson’s construction.
Functional programmers however often prefer to use derivatives instead due to their elegance and ease of implementation. It’s also a fairly fun exercise to do it in OCaml.

Note that all of this is only loosely related to actually writing a compiler. If you really want to study compilers, use ocamllex/sedlex and menhir to write your grammar, and start working on a real language (and get yourself the TAPL. The dragon book is from the 80s, and they apparently hadn’t discover type systems yet ).

perry · November 5, 2018, 3:35pm

TAPL is an exceptionally good book, and I recommend it to anyone interested in programming languages, but it doesn’t cover the same material as the dragon book at all. They’re almost completely disjoint. There’s not a line in TAPL about topics like parsing or machine code generation.

joelj · November 5, 2018, 3:58pm

I know of lex/yacc/menhir, but I’m a bit sceptic using them, it looks like almost no production compilers are using them? For some reason almost all of them uses a hand-written top-down recursive decent parser, usually in a single huge file called “parser”. Some people say it’s because it’s easier to produce nice error messages this way, and for performance reasons. Still feels odd. It would be much nicer if both humans and compilers could both use a formal grammar as the primary source of information to understand source code written in a language, instead of having to try to understand huge parser files.

Here is a list of some modern compilers and the corresponding parser for each one:

gist.github.com

https://gist.github.com/joelonsql/2c45515d1a9e0e1cca5a475b56ff0fda

parsers.md

Language | Corporation | LoC | Size | Parser
-------- | ----------- | --- | ---- | ------
WebAssembly |  | 753 lines | 23.8 KB | https://github.com/WebAssembly/spec/blob/master/interpreter/text/parser.mly
Go | Google | 2522 lines | 62.3 KB | https://github.com/golang/go/blob/master/src/go/parser/parser.go
PL/pgSQL |  | 4012 lines | 103 KB | https://github.com/postgres/postgres/blob/master/src/pl/plpgsql/src/pl_gram.y
Python |  | 5297 lines | 161 KB | https://github.com/python/cpython/blob/master/Python/ast.c
Rust | Mozilla | 6363 lines | 249 KB | https://github.com/rust-lang/rust/blob/master/src/libsyntax/parse/parser.rs
C++ (Clang) | Apple | 6822 lines  | 247 KB | https://github.com/llvm-mirror/clang/blob/master/lib/Parse/ParseDecl.cpp
D |  | 8594 lines | 268 KB | https://github.com/dlang/dmd/blob/master/src/ddmd/parse.d
C# | Microsoft | 11425 lines | 473 KB | https://github.com/dotnet/roslyn/blob/master/src/Compilers/CSharp/Portable/Parser/LanguageParser.cs

This file has been truncated. show original

joelj · November 5, 2018, 4:01pm

(Out of these, only WebAssembly, PL/pgSQL/SQL made use of something like yacc, the rest had hand-written parsers.)

Drup · November 5, 2018, 4:07pm

You are missing my point:

Syntactic and lexical analysis are the least interesting part of writing a compiler. In my opinion, they barely qualify as something that should be in a compiler course. The actual interesting part of a compiler, is writing program verification and transformation.
Learning about LL/LR/recursive decent will not teach you about program transformations.
People overthink parsing, it’s a solved issue in 90% of the cases, especially with amazing tools such as menhir.

So: get parsing out of the way as fast as you can: write your parser using available tools that make that trivial so that you can play with the really fun stuff, and let production-grade compilers over-engineer the design of their parsers.

joelj · November 5, 2018, 4:12pm

Maybe this attitude towards parsing is the reason why production compilers mostly use hand-written huge single file parsers. Maybe they simply wanted “parser out of the way as fast as you can” when they started to design their language.

I’m not convinced it’s a solved issue.

joelj · November 5, 2018, 4:25pm

I must confess I think parsing and grammars is really fun stuff

perry · November 5, 2018, 4:50pm

I’d say Menhir beats hand-written recursive descent parsers in most cases. Unfortunately, compilers for some languages (say, C++) involve grammars that are such a royal mess that this isn’t a practical solution, but I’ll note that OCaml itself uses ocamllex + menhir. (Until recently it used ocamlyacc.)

I agree that parsing is a small part of the task of writing a compiler, but it’s a big topic in general, and rather important to computer security (see the work done by the LangSec community in recent years.)

dbuenzli · November 5, 2018, 11:24pm

Right that’s the reason why we had top notch syntax error messages from the OCaml compiler for the last twenty years…

Getting parsing right has a big impact on the usability of your language.

perry · November 6, 2018, 12:45pm

I think academics like teaching parsing a lot because it was one of the first places where pure theory managed to produce results that heavily improved the state of the art. The theory of regular and context free grammars, the various algorithms for dealing with parsing them, etc., are relatively pretty, and so they’re like catnip to an instructor.

That said, parsing remains a major problem. As @dbuenzli indicated, it’s pretty recent (with tools like Menhir) that one could get good error messages out of an LR parser generator created parser, and the whole area is still ripe for research.

And that said, I’ll agree with @Drup that even if parsing is going to remain a big area for research for a long time, it’s only a small fraction of the whole compiler problem; you could spend a whole year course just on optimization I think and not really touch on everything of importance.

dc-mak · November 9, 2018, 10:38am

Here’s something I wrote using the Dragon Book and Standard ML yonks ago. A good exercise would be to translate it to OCaml, clean it up and use more library (standard or Base) functions.

github.com

dc-mak/ML/blob/master/Regex.sml

(* Invariant: always one start state and one end state. *)
type nfa = {n : int,                     (* number of states *)
            s : int,                     (* start state *)
            t : (int * char * int) list, (* transitions for non-empty characters *)
            d : (int * int) list,        (* epsilon-transitions *)
            f : int};                    (* end state *)

datatype regex = Union  of regex * regex
               | Concat of regex * regex
               | Star   of regex
               | Char   of char option;

fun disjoint ({n,s,t,d,f}, m) = 
    {n = n, s = s+m, f = f+m, 
     d = map (fn (x,y)   => (x+m, y+m))    d,
     t = map (fn (x,a,y) => (x+m, a, y+m)) t}

(* construct : regex -> nfa *)

    (* one start state that is accepting *)

This file has been truncated. show original

dc-mak · November 9, 2018, 10:40am

I also took that exact course you reference last year (and did that exercise) - maybe wait a bit (a long bit?) if you’re just starting out with OCaml before trying that.

Topic		Replies	Views
Re2ocaml regexp compiler Ecosystem regexp , lexer	12	641	February 18, 2025
How to convert string to type "regexp"? Learning string	4	1290	June 22, 2020
High-performance lexing in OCaml Ecosystem performance , ocamllex	20	3041	April 1, 2021
Good example of handwritten Lexer + Recursive Descent Parser? Learning	6	650	November 29, 2024
Looking for an OCaml parser in ANTLR or JFlex Ecosystem	6	1895	April 16, 2021

How to convert Regexp to NFA in OCaml?

Related topics