How severe is the Menhir code size penalty for having multiple start symbols?

TheAspiringHacker · April 9, 2019, 5:00pm

The LR automaton generation algorithm creates “item sets” based on the nonterminal to be parsed. These item sets eventually become the states of the automaton for parsing the nonterminal. So, if there are multiple start symbols, each start symbol needs its own automaton and has its own item sets. As the number of states in the LR automaton for a practical grammar can be in the hundreds, I am worried that having multiple start symbols can result in significant code size.

I copied https://gitlab.inria.fr/fpottier/menhir/blob/master/demos/calc-dune/parser.mly and found out that the size of the generated .ml file was 19250. Then, I added %start<int> expr and generated the file again, and it was 24322 bytes. Should I be worried about the code size when generating parsers for multiple nonterminals in a grammar? My use case involves compiling the code to JavaScript such that it gets downloaded by the browser when the user visits a page.

Contrast parser generators with handwritten parsers, where each nonterminal has its own function and the functions can be reused. Would handwritten parsers be more appropriate when I have multiple entry points?

EDIT: I tested Menhir with https://gitlab.com/emelle/emelle/blob/b8b2edb56da5dd0730d88587722ff872ce9fb2e8/src/syntax/parser.mly by generating the .ml file with all the start symbols versus only with the file start symbol, and the sizes of the files were 254972 versus 241122, so as the grammar size grows larger, the code size penalty doesn’t seem to be that severe…

Drup · April 9, 2019, 6:08pm

I can’t completely answer your question, but note that you could try to use the --lalr option, which restrict menhir to LALR grammars and thus reduces the size of the automaton significantly.

Leandros · April 11, 2019, 3:43pm

I have a fairly large grammar in menhir. I’ve tested it a couple configurations.

`--lalr` / 1 Entry	`--canonical` / 1 Entry	`--lalr` / 2 Entries	`--canonical` / 2 Entries
4355289	31289026	4374858	32607871

I’m looking at the .automaton and taking the size in bytes. In canonical mode, my grammar has 8.8k states. In LALR mode a measly 900.
This is with the table based grammar.
I’ve also tried to omit the --lalr flag, but the results didn’t change.

let-def · April 12, 2019, 12:32pm

States are shared between the different entry points, so adding entries won’t grow the automaton size significantly (unless large parts of the grammar are completely unreacheable from some entry points, but you should get warnings about that).

The size of the automaton is unrelated to the backend in use, however it is affected by the algorithm (default LR(1) with merging, LALR, canonical).

LR(1) should do well, I don’t think you should worry about that.

Topic		Replies	Views
Gynormous memory consumption to compile output of Menhir Ecosystem menhir	6	626	April 25, 2023
Help wanted understanding Menhir parser Learning	8	3400	September 16, 2018
[ANN] New release of Menhir (20230608) Community announce	0	488	June 14, 2023
[ANN] New release of Menhir Community announce	0	1236	December 7, 2017
[ANN] New release of Menhir (20211230) Community announce	0	1220	December 31, 2021

How severe is the Menhir code size penalty for having multiple start symbols?

Related topics