The LR automaton generation algorithm creates “item sets” based on the nonterminal to be parsed. These item sets eventually become the states of the automaton for parsing the nonterminal. So, if there are multiple start symbols, each start symbol needs its own automaton and has its own item sets. As the number of states in the LR automaton for a practical grammar can be in the hundreds, I am worried that having multiple start symbols can result in significant code size.
I copied https://gitlab.inria.fr/fpottier/menhir/blob/master/demos/calc-dune/parser.mly and found out that the size of the generated .ml file was 19250. Then, I added %start<int> expr
and generated the file again, and it was 24322 bytes. Should I be worried about the code size when generating parsers for multiple nonterminals in a grammar? My use case involves compiling the code to JavaScript such that it gets downloaded by the browser when the user visits a page.
Contrast parser generators with handwritten parsers, where each nonterminal has its own function and the functions can be reused. Would handwritten parsers be more appropriate when I have multiple entry points?
EDIT: I tested Menhir with https://gitlab.com/emelle/emelle/blob/b8b2edb56da5dd0730d88587722ff872ce9fb2e8/src/syntax/parser.mly by generating the .ml file with all the start symbols versus only with the file
start symbol, and the sizes of the files were 254972 versus 241122, so as the grammar size grows larger, the code size penalty doesn’t seem to be that severe…