Note that the exercise is using Meta OCaml – which is different from OCaml. You are probably aware of the Re library which is a complete regular expression library but it goes beyond the basics (I have not looked at it in detail). I would be interested in a discussion how to implement the non-deterministic automaton (not its construction from a regular expression) and its execution in OCaml as I can think of several approaches.
I think you’ll find that there’s a reluctance to post actual solutions to academic homework problems online, for obvious reasons.
That said, if this isn’t homework but you’re truly interested in the topic, I recommend reading the appropriate section of the Dragon Book and also this series of articles by Russ Cox. Neither focuses on the use of OCaml for this purpose, but that’s not important to understanding the algorithms involved. From those two references, it should be pretty straightforward to turn an AST of a regexp into an NFA.
As for how you parse regular expressions into an AST suitable for NFA conversion, I’d suggest that a simple recursive descent parser works well for fairly simple regex formats; you can treat individual characters in the input as tokens quite successfully. I wrote such a parser in OCaml as an exercise a while back and it was pretty straightforward. The recursive descent technique is explained pretty broadly online.
If this isn’t a homework problem, the above should be able to get you started.
This isn’t homework, I’m genuinely interested in compilers
I had read the brilliant articles by Russ Cox already, that’s actually how this started! I understood why the regexp engines in modern languages are so complex, since it’s due to the need to support complex features such as back-references.
The “Implementation” section of the article uses C for all code. Since it’s using lots of for loops and if statements,
it’s therefore difficult for me as an OCaml beginner to translate it into idiomatic OCaml code.
I think many of the points made by Russ Cox back in 2007 are still valid 2018. Not all regexp expressions makes use of all the complex features, so maybe there is a market for such a library like RE2, but implemented in pure OCaml.
I think a series of articles based on Russ Cox’s original one, but where OCaml instead of C is used as the language for the reference implementation, would be very interesting to a lot of people, who are looking at OCaml as a candidate language for their parser/compiler projects.
If it would be possible to somehow pay someone to spend the necessary time on it, I would happily do so. The work could with benefit be made open source under MIT license. Please let me know if someone in this forum would be interested.
Russ Cox’s set of articles are very good if you intend to make a fast, practical, regex engine, but they are overkill if you just want an introduction.
Turning a regex into an NFA is often done using Thomson’s construction.
Functional programmers however often prefer to use derivatives instead due to their elegance and ease of implementation. It’s also a fairly fun exercise to do it in OCaml.
Note that all of this is only loosely related to actually writing a compiler. If you really want to study compilers, use ocamllex/sedlex and menhir to write your grammar, and start working on a real language (and get yourself the TAPL. The dragon book is from the 80s, and they apparently hadn’t discover type systems yet ).
TAPL is an exceptionally good book, and I recommend it to anyone interested in programming languages, but it doesn’t cover the same material as the dragon book at all. They’re almost completely disjoint. There’s not a line in TAPL about topics like parsing or machine code generation.
I know of lex/yacc/menhir, but I’m a bit sceptic using them, it looks like almost no production compilers are using them? For some reason almost all of them uses a hand-written top-down recursive decent parser, usually in a single huge file called “parser”. Some people say it’s because it’s easier to produce nice error messages this way, and for performance reasons. Still feels odd. It would be much nicer if both humans and compilers could both use a formal grammar as the primary source of information to understand source code written in a language, instead of having to try to understand huge parser files.
Here is a list of some modern compilers and the corresponding parser for each one:
Syntactic and lexical analysis are the least interesting part of writing a compiler. In my opinion, they barely qualify as something that should be in a compiler course. The actual interesting part of a compiler, is writing program verification and transformation.
Learning about LL/LR/recursive decent will not teach you about program transformations.
People overthink parsing, it’s a solved issue in 90% of the cases, especially with amazing tools such as menhir.
So: get parsing out of the way as fast as you can: write your parser using available tools that make that trivial so that you can play with the really fun stuff, and let production-grade compilers over-engineer the design of their parsers.
Maybe this attitude towards parsing is the reason why production compilers mostly use hand-written huge single file parsers. Maybe they simply wanted “parser out of the way as fast as you can” when they started to design their language.
I’d say Menhir beats hand-written recursive descent parsers in most cases. Unfortunately, compilers for some languages (say, C++) involve grammars that are such a royal mess that this isn’t a practical solution, but I’ll note that OCaml itself uses ocamllex + menhir. (Until recently it used ocamlyacc.)
I agree that parsing is a small part of the task of writing a compiler, but it’s a big topic in general, and rather important to computer security (see the work done by the LangSec community in recent years.)
I think academics like teaching parsing a lot because it was one of the first places where pure theory managed to produce results that heavily improved the state of the art. The theory of regular and context free grammars, the various algorithms for dealing with parsing them, etc., are relatively pretty, and so they’re like catnip to an instructor.
That said, parsing remains a major problem. As @dbuenzli indicated, it’s pretty recent (with tools like Menhir) that one could get good error messages out of an LR parser generator created parser, and the whole area is still ripe for research.
And that said, I’ll agree with @Drup that even if parsing is going to remain a big area for research for a long time, it’s only a small fraction of the whole compiler problem; you could spend a whole year course just on optimization I think and not really touch on everything of importance.
Here’s something I wrote using the Dragon Book and Standard ML yonks ago. A good exercise would be to translate it to OCaml, clean it up and use more library (standard or Base) functions.
I also took that exact course you reference last year (and did that exercise) - maybe wait a bit (a long bit?) if you’re just starting out with OCaml before trying that.