TL;DR Parser combinators appear to me to be “combinators for LL(1) parsing”. That leaves all the work of converting a pile of regexps into a backtracking-free combinator-function-nest. And it isn’t at all clear that performance will be better.
I don’t really know. But I doubt it. Here’s why:
- the traditional divide between lexing and parsing concentrates the need for backtracking to deal with ambiguity into the lexer. There, it is dealt-with by precomputing the lexing automaton, so that at runtime, there is no backtracking.
- parser-combinators either just use backtracking, or require the user to do the work of converting a collection of regexps that are not LL(1) (in the sense that a single character of input does not suffice to choose which one to go with) into a parser-combinator-nest that is backtracking-free.
- A good example that I experienced recently was the Camlp5 “pa_lexer” syntax extension: it is designed to write lexers, but without backtracking and also without the lookahead-without-consuming capability of lex. It’s significantly more complicated to write a complex lexing specification in that language than in plain old lex.
- It isn’t clear that performance would be better with parser-combinators, but hey, maybe it would. I searched around for performance comparisons, and what I find are:
- comparisons for simple examples between combinators and other methods, but not regexps or lex
- after that, only comparisons between different combinator libraries.
- I’ve experimented [on a problem posed here by @rwmjones – it was a problem around parsing a syntax for expressing disk-contents] with solving a nontrivial parsing problem using Camlp5’s stream-parsers, and eventually realized that the fastest, fastest, FASTEST technique (which also was the simplest) was to build a big-ass regexp and hand that to pcre, let it do its magic. He started with a Perl script, and wanted something faster. When it turned out that almost all the work could be done in a big-ass regexp, it seemed pretty clear to me that while sure, OCaml was nice, if it were me, I’d just implement that regexp in Perl and take the win. B/c typically Perl is pretty fast at string-processing, esp. when most of the work is in regexps.
It’s very difficult to assess whether parser combinators are suitable for actual “industrial strength” parsing, where that is defined as “the incumbent is C++ flex/bison”. Not to speak of more esoteric approaches in C/C++.
Overall, from what I can see, parser combinators are just recursive-descent parsing dressed-up in combinators. That’s not bad: I’m a gynormous fan of Camlp5’s stream parsers. But I wouldn’t use them for writing a lexer, b/c boy howdy, that’s unpleasant work compared to writing a pile of regexps and stepping back while the lex automaton algorithm does its work.
P.S. I hope it’s clear that I’m not saying parser-combinators aren’t useful. I use them (in the form of Camlp5 stream-parsers) all the time for parsing binary protocols, e.g. Java classfile format, CORBA IIOP, etc. A distinguishing feature of RPC protocols, is that they are all LL(1) languages at the character level. So parser-combinators (or Camlp5 sream parsers) are perfect.