How to convert string to type "regexp"?

I got this type:
type regexp =
| V ( * void * )
| E ( * epsilon * )
| C of char ( * char * )
| U of regexp * regexp ( * a + b * )
| P of regexp * regexp ( * a.b * )
| S of regexp ( * a* *)
;;
How do I convert a string like this (a + b) * into an expression of type regexp S (U (C ‘a’, C ‘b’))?

You’ll need a parser (and perhaps a lexer). There are a bunch of ways to do that: you can use ocamlyacc/menhir (and ocamllex/sedlex), and then there are a bunch of parsing combinator libraries (IIRC, Angstrom is one). You could also write a recursive-descent parser by-hand (nobody does that anymore except in first-year programming classes grin).

If you’ve never written a parser before, I’d strongly suggest you back up and learn how to write one for something like a simple arithmetic-expression language (+, -, /, *, unary-minus) before approaching your problem, simply because for the arithmetic-expression language, it’ll be -easy- to discern when you’re making mistakes.

Also, that is typically the sort of toy problem that is presented in most tutorials for parser-generation tools (like yacc/menhir, angstrom, etc).

If you’ve never done this before, don’t be afraid: we all had to do it once, and it’s completely surmountable. There are many books on the subject: any decent compiler-construction book will have a chapter or two on lexing and parsing at the beginning, and it’s definitely worth finding one and reading/following that, also.

i think it is my case

Take heart. I have a friend who teaches an intro compiler course, and he told me once that he spends most of the time on parsing and lexing … and not on semantic analysis (e.g. typechecking) and code generation. Why? Because in their careers, every one of his students will end up designing and implementing “little languages” over-and-over-and-over. They will need to be very comfortable with parsing/lexing. But very few will have the privilege of writing a real compiler with type-checking and code-generation, or even modifying such a compiler. It’s just not that common a task for programmers, even though it’s an important one.

So he spends most of his energy making sure his students know how to do the tasks that they’re going to see in the real world.

ETA: concretely, I’m sure many people will have good suggestions for compiler books. I’m old enough that it was the “Dragon Book” (_Compilers: Principles, Techniques, and Tools" by Aho, and eventually a cast of other well-regarded CS profs). Really, any compiler book will do, b/c you’re wanting to learn the -beginning- of the book. I searched for one that presented the material in Ocaml, to no avail. But also, the part that you will need to learn, is completely independent of the programming language you choose to implement in: the technology of lexing and parsing is all implementation-language-agnostic. Hence, “ocamlyacc” takes its name from “yacc” which generated code for C. IIRC the original caml-light “camlyacc” actually invoked yacc under the covers to generate the parsing tables, but I could be mis-remembering (gettin’ old).

Take heart! You can learn this! And you’ll use it over and over throughout your career!

I once needed a concurrent parser combinator library that accepts network channels directly, but neither go nor erlang community seemed to have such one.

and by the way, a regexp engine with such capabilities was also created.

The 44 LOC function, str2reg, explains how:

and https://github.com/kandu/ok_parsec/blob/master/src/re.ml#L221 is

let make str= (dfa2sm (nfa2dfa (reg2nfa (str2reg str))))

illustrates the entire process on how a string is transformed, string -> internal regex representation -> nfa -> dfa-> state machine

And there is a test executable in the test directory. Invoke it, you will get outputted graphviz files of the intermediate nfa & dfa & state machine , from which svg images are also generated, and finally the test program will open the image for you.

1 Like