How severe is the Menhir code size penalty for having multiple start symbols?

The LR automaton generation algorithm creates “item sets” based on the nonterminal to be parsed. These item sets eventually become the states of the automaton for parsing the nonterminal. So, if there are multiple start symbols, each start symbol needs its own automaton and has its own item sets. As the number of states in the LR automaton for a practical grammar can be in the hundreds, I am worried that having multiple start symbols can result in significant code size.

I copied https://gitlab.inria.fr/fpottier/menhir/blob/master/demos/calc-dune/parser.mly and found out that the size of the generated .ml file was 19250. Then, I added %start<int> expr and generated the file again, and it was 24322 bytes. Should I be worried about the code size when generating parsers for multiple nonterminals in a grammar? My use case involves compiling the code to JavaScript such that it gets downloaded by the browser when the user visits a page.

Contrast parser generators with handwritten parsers, where each nonterminal has its own function and the functions can be reused. Would handwritten parsers be more appropriate when I have multiple entry points?

EDIT: I tested Menhir with https://gitlab.com/emelle/emelle/blob/b8b2edb56da5dd0730d88587722ff872ce9fb2e8/src/syntax/parser.mly by generating the .ml file with all the start symbols versus only with the file start symbol, and the sizes of the files were 254972 versus 241122, so as the grammar size grows larger, the code size penalty doesn’t seem to be that severe…

I can’t completely answer your question, but note that you could try to use the --lalr option, which restrict menhir to LALR grammars and thus reduces the size of the automaton significantly.

I have a fairly large grammar in menhir. I’ve tested it a couple configurations.

--lalr / 1 Entry --canonical / 1 Entry --lalr / 2 Entries --canonical / 2 Entries
4355289 31289026 4374858 32607871

I’m looking at the .automaton and taking the size in bytes. In canonical mode, my grammar has 8.8k states. In LALR mode a measly 900.
This is with the table based grammar.
I’ve also tried to omit the --lalr flag, but the results didn’t change.

States are shared between the different entry points, so adding entries won’t grow the automaton size significantly (unless large parts of the grammar are completely unreacheable from some entry points, but you should get warnings about that).

The size of the automaton is unrelated to the backend in use, however it is affected by the algorithm (default LR(1) with merging, LALR, canonical).

LR(1) should do well, I don’t think you should worry about that.

3 Likes