Why a handwritten parser?

Continuing the discussion from Is rescript damaging Ocaml ecosystem?:

Here’s a thread with more discussion points on parser generators vs hand-written recursive descent parsers: Show HN: How to write a recursive descent parser | Hacker News

Here’s Nicklaus Wirth on it:

As it happens with new fields of endeavor, research went rather beyond the needs of the first hour. More and more powerful parser generators were developed, which managed to handle ever and ever more general and complex grammars. Albeit this was an intellectual achievement, the consequence was less positive. It led language designers to believe that no matter what syntactic construct they postulated, automatic tools could surely detect ambiguities, and some powerful parser would certainly cope with it. A misled idea! No such tool would give any indication how that syntax could be improved. Not only had designers ignored the issue of efficiency, but also the fact that a language serves the human reader, and not only the automatic parser. If a language poses difficulties to parsers, it surely does so for the human reader too. Many languages would be clearer and cleaner, had their designers been forced to use a simple parsing method.

I can strongly support this statement by my own experiences. After having contributed in the 1960s to the development of parsers for precedence grammars, and having used them for the implementation of the languages Euler and Algol W, I decided to switch to the simplest parsing method for Pascal, the method of top-down, recursive descent. The experience was most encouraging, and I stuck to it up to this day with great satisfaction.

6 Likes

Does anyone know whether there has been any experiment in writing a recursive-descent parser for the standard OCaml grammar? I guess camlp{4,5} have one, but it would be hard to extract…

Cheers,
Nicolas

This is a really fraught area. I myself am a confirmed fan of LL(1) grammars, b/c they’re extensible, correspond to my intuition about languages, and during language-development, error-detection is a local phenomenon. But OTOH

  1. LL(1) isn’t sufficient to cleanly express a grammar for the modern OCaml language – there are quite a few spots where one has to use >1-token lookahead. Manifestly LALR(1) -is- enough to express a grammar for that language.
  2. A lot of the grief LALR(1) gets is (I suspect) due to the fact that the grammar-compilation algorithm is global, so its errors are … impenetrable. You get a conflict, and it doesn’t really tell you where or why the error happened. To figure it out, you have to re-run the LR algorithm and see where things went wrong. but this is remediable – Modern versions of Menhir, I have been told, do a much, much better job of leading you to the actual rules which conflict, and giving feedback for why a conflict arose.
  3. predictive (LL(1)) parsers give “better” errors when confronted with bad input: but this, again, is a place where systems like Menhir have made great progress. I don’t know whether bison has made such progress, and maybe that’s why people rewrote the C++ parser from bison to recursive-descent. If that’s the case, it’s a real pity.

I’m a confirmed fan of LL(1) [and, someday, LL(k)]. But even I can see that many of the knocks that LALR(1) parser-generators get are based on older implementations.

1 Like

As a person who might started this vary flamable discussion I want to mention that

  1. I have a feeling that table-based parsers should show better performance because they are essentially DSL optimized for parsing
  2. I’m more interested in parser-combinator approach more than in a handwritten recursive descent because I believe that amount of work to support it should be quite significant. Yeah, performance of parser-combinators can be not so great but maybe there are silver bullets like Staged Selective Parser Combinators, which can reduce margin to less then 20%

I’m curious: how do you implement things like left-factorization with parser-combinators? That is, how do you -express- a grammar in a manner that is ambiguous (but that, when left-factorized is no longer ambiguous) using parser combinators?

Maybe GLL is the way to go? From what I’ve been following it seems like the holy grail

A quick google tells me that it supports all context-free grammars. Almost by definition that’s the wrong thing to do, I fear. You don’t want multiple parse-trees, nor even multiple derivations coexisting during the parse, only for one to emerge at the end.

ETA: I mentioned LL(k). And what I specifically wonder about, is ANTLR. I haven’t used it, but it seems like the right approach, and at some point I’m going to have to try it out.

I guess for the goal of using a parser combinator is simplicity (no alternative syntax, no preprocessing), factorization is either a manual optimization job that are left to the user, or require compiler optimization that are infrequently found outside very experimental programming languages (it has to discover lots of properties from the parsec program). In practice, the choice is to develop a parsec that directly accepts some ambiguous left recursive grammars (via memo or something), instead of try factorization.

1 Like

Um … it seems to me that if we’re going to say that left-factorization is something we can’t expect from our parser support system, then … why should we expect our match-compiler to deal with things like

match x with A(B ..., C...) -> e1 | A(D ..., E ..) -> e2 | F ... -> e3

by “factorizing” out the “A” constructor?

ETA: Camlp5’s extensible grammars and stream-parsers have always (AFAIK) supported left-factorization. The extensible Coq V5.10 parser I wrote did so (at least, IIRC) 27yr ago, and I got the idea from Luca Cardelli’s extensible syntax for his FOmega-based language implementation. This isn’t esoteric stuff.

For a side project I needed to parse a C-like language that was already defined with an EBNF grammar. I started out using Angstrom to write a recursive descent parser, and after hitting a few bugs I decided to use Menhir and ocamllex instead. I figured that it would be fairly trivial to convert the EBNF grammar into Menhir productions, and I would have more confidence that the result was correct.

That turned out to be true for the most part, and overall I’m happy with the result. It did end up being a significant time investment to learn to use, and I will need to invest more time to get better error messages. But you only need to learn it once, right?

Sorry for that I’m only a moderate level user of parsing libraries. I did some reading on Meijer’s Monadic Parser Combinator

And confirmed that the chainl combinator that handles left recursion defined in section 4.3 eliminates the need for left factorization, according to section 5.1.

1 Like

That’s an interesting find. I was taught compiler construction by one of his now famous student. Parser generators were mentioned but hand-written parsers where recommended.

That was twenty years ago :older_man: but AFAIR the two main reasons mentioned were the flexibility to devise good error reports (no Error: Syntax error :joy:) and to recover from parse error to report more (non misleading) errors in a single parse.

At that point the menhir people will chime in to tell me how wrong these ideas are :–) but somehow this had a durable effect on me and I always write my parsers manually (modulo a past obsession with combinator parsers).

5 Likes