This is not an OCaml question, it is a design question about writing a parser grammar, but there are many people in the OCaml community who write parsers so I hope this is an appropriate question.
I forked someone’s tree-sitter grammar for Org, a markup language. The language has some commands relating to calendar events, for example
SCHEDULED: <2024-09-28 Sun>
DEADLINE: <2024-10-05 Sun>
CLOSED: <2024-10-05 Sun>
These three keywords are part of the language documentation.
The grammar I was looking at parses this and returns a pair (keyword, date)
where keyword
is the string before the colon and date
is the date. The parser accepts any string of alphabetical characters to be the keyword.
As far as I’m concerned, at some point in the business logic I have to do a case analysis on whether the keyword is SCHEDULED
, DEADLINE
or CLOSED
, as these have different semantic meanings of course. I can either put that logic in the code myself or modify the parser to do this. I have been following the rule of thumb that “Any logic that can go in the parser, is by definition parsing logic, so it should go in the parser.” So I want to move as much functionality as possible into the tree-sitter grammar itself. In this case this would take the form of adding specific constant keywords “SCHEDULED”, “DEADLINE” and “CLOSED” to the grammar, and having the data structure returned by the parser distinguish between these three at the level of node types.
Is “Any logic that can go in the parser should go in the parser” a good rule of thumb, or is this a mistake?
I note that in this particular case,
- Instead of modifying the grammar I could just define three different parser queries, each one with the particular string hardcoded into it, and then use those queries in many places
- the culture of “hackability” around Emacs encourages extensibility and flexibility, in this case letting the keyword be an arbitrary alphabetical string allows users to augment the language with additional keywords as they see fit