The LR automaton generation algorithm creates “item sets” based on the nonterminal to be parsed. These item sets eventually become the states of the automaton for parsing the nonterminal. So, if there are multiple start symbols, each start symbol needs its own automaton and has its own item sets. As the number of states in the LR automaton for a practical grammar can be in the hundreds, I am worried that having multiple start symbols can result in significant code size.
I copied https://gitlab.inria.fr/fpottier/menhir/blob/master/demos/calc-dune/parser.mly and found out that the size of the generated .ml file was 19250. Then, I added
Contrast parser generators with handwritten parsers, where each nonterminal has its own function and the functions can be reused. Would handwritten parsers be more appropriate when I have multiple entry points?
EDIT: I tested Menhir with https://gitlab.com/emelle/emelle/blob/b8b2edb56da5dd0730d88587722ff872ce9fb2e8/src/syntax/parser.mly by generating the .ml file with all the start symbols versus only with the
file start symbol, and the sizes of the files were 254972 versus 241122, so as the grammar size grows larger, the code size penalty doesn’t seem to be that severe…