Can 'Error: Syntax error' be improved?

Eg:

# let class = 1;;
      ^^^^^
Error: Syntax error

How difficult would it be to add some explanation here, eg Error: Syntax error: expected a pattern, got 'class'?

1 Like

Not trivial: Research

Cheers,
Nicolas

1 Like

Thanks. Looking at: https://def.lakaban.net/research/2024-LRGrep-dissertation.pdf

It says:

However, only some rules can utilise the error symbol due to the rapid introduction of ambiguities. When multiple error clauses apply to a single situation, which one should prevail? As a result, most cases are not covered, and the parser defaults to a generic “Syntax error” message.

Well, what would happen if the first error clause was always chosen? For example, what would be reported for my code snippet given above?

There’s a different answer: predictive parsers do this, right? And LR parsers are not predictive.

Aside from engineering effort, is there anything standing in the way of OCaml using something more error message friendly like recursive descent instead of an LR parser?

If a file has the code:

let class = 1

And the parser reads the characters let , it knows that the next expected token must be a pattern, right? EDIT: or the keyword open :slight_smile:

The CAML system (which preceded Caml Light, which preceded OCaml) had a recursive descent parser; see the Appendix of the CAML reference manual.

In the paper that introduced the system that would become Caml Light and then OCaml, @xavierleroy had this to say about the parser (see Section 5.2.1, page 55 and 56):

Parsing ML is not easy: there is a huge number of possible ambiguities, since most constructs are “open” (i.e. without closing delimitor), and even more so if one wants to provide the same amount of syntactic sugar provided by CAML, for compatibility reasons. The conventional approach is to multiply the nonterminals of the grammar (e.g. having several entry points for expressions, corresponding to expressions with various priorities), in order to resolve these ambiguities in the grammar definition itself. This leads to very convoluted and hard to read grammars; the actual grammar of CAML [62, appendix Z] is a very convincing example of this tendency!

Yannick Martel and I dismissed this approach and chose instead to write a very natural grammar, but highly ambiguous, and rely on Yacc’s disambiguation mechanisms (precedences, associativities, etc.) to get correct parsing. The result is a concise grammar (about five times smaller than the corresponding fragment of CAML grammar), fairly easy to read (even for a non-specialist like me), but with some 50 reduce/reduce conflicts and 500 shift/reduce conflicts! As it is impossible to check each conflict to see if it was resolved correctly, it is not possible to know exactly what the resulting parser does, and getting a parser that works correctly on test CAML programs required quite a bit of trial and error… Yacc is partly responsible for this situation, since the conflict resport and disambiguation tools it provides are fairly crude. For instance, precedence and associativities are not taken into account to solve reduce/reduce conflicts. And to understand where the conflicts lie, one has to look at the generated automata, which requires training and fortitude. We conclude that common parser generator technology is not yet powerful enough to easily tackle languages with such a tricky syntax as CAML.

Since then, this decision has not been revisited in the main system as far as I know (other than to switch to Menhir a few years ago). The Camlp4 and Camlp5 systems had a reimplementation of the parser using recursive descent technology that was maintained in parallel, but they had their own complexity issues and they are not actively developed anymore.

In summary: writing a recursive descent parser for OCaml is technically possible, but a considerable amount of work and no-one has actually stepped forward to experiment with such an approach. Even if successful, the resulting parser is likely to be of considerable complexity and have the same drawbacks that motivated the switch to LR parsers in the first place… On the other hand, it could allow for better error reporting; but it is not clear if this would be enough to justify a switch back to recursive descent.

Cheers,
Nicolas

6 Likes

Moreover, any effort taken to write a recursive descent parser would be probably more efficiently spent taking the time to apply menhir and LRgrep improved error reporting to the existing parser. I know this is the todo list of few people (me at the very least).

2 Likes

I maintain the Camlp5 grammar that is the descendant of the LL parser you describe, and I can attest to the truth in your summary. And you’re 100% right, that the switch back to LL would not be justified.

What -might- justify such a switch, would be switching to using ANTLR, with much more lookahead. But that’d be an enormous change, not least since it would require either using ANTLR’s Java code, or porting that to OCaml.

Does the error report need to be absolutely perfect? Here’s a Java example:

// Test.java

class Test {
  int i;

Output:

$ javac Test.java
Test.java:2: error: reached end of file while parsing
  int i;
        ^
1 error

Obviously the error is the missing closing brace. But it just says ‘reached end of file’. But clearly it’s something and at least points you in the right direction.

My point is, is there a way with Menhir’s error reporting to just pick one of the ambiguous options, maybe the first one, and report it as the error? Surely reporting something is better than reporting nothing?