I have always been fascinated by LR parsing. The concept is amazing. You write a
context free grammar and there is a yacc like parser generator which transforms
the grammar into a program which does a table driven bottoms up parsing. The
parser generator accepts all types of recursion and converts it to a parsing
table.
In terms of performance table driven parsers are unbeatable. Therefore whenever
I had to parse some syntactic constructs in a programming language I searched
for a yacc like parser generator.
The same happened as I started to program in ocaml. In the standard library
there is a module called ocamlyacc
which is more or less compatible with the
standard yacc
for c. And there is this excellent tool called menhir
which
gives a lot of help in writing context free grammars correctly.
I have never considered top down parsing like parsing expression grammars which
are used in combinator parsing libraries. The avoidance of left recursion seemed
to be unnatural.
But every time using LR parsing I had to fight against shift reduce or even
worse reduce reduce conflicts. For simple grammars the conflicts could be
resolved easily. But for more complicated grammars the resolution of the
conflicts took a significant effort.
Furthermore I wanted to be able to run my parsers within the browser. Since
nearly all parser generators are designed to read their input from a file
(usually via some lexer) their usuage within a browser required some complicated
workarounds.
This made me look into combinator parsing. I have looked into the most prominent
combinator parser parsec
and I first discovered two things:
-
The avoidance of left recursion is never a problem, because there are
combinators which allow the parsing of zero or more occurrences of a construct
easily. -
The biased choice construct avoids the fight against shift reduce or reduce
reduce conflicts.
Then I discovered this <?>
combinator which is a great tool in writing some
comprehensible syntax error messages.
Moreover writing combinator parsers is programming in your functional
programming language. No 2 pass compiling, just compile your functions.
Unfortunately Haskell is not my implementation language. I wanted to implement
my parsers in ocaml.
As the appetite for combinator parsing was growing, my requirements were growing
as well. I wanted a parser satisfying the following requirements:
-
Inversion of control: I.e. the parser should not control the input of the
token. The token should be pushed into the parser. This requirement allows the
usage of the parser in any environment. In the browser I just get string and
can push the string content into the parser. In a console program a read the
file content or the output of another program or the content of a tcp stream
chunk by chunk and push the chunk content into the parser. -
Access to the state at any time: It should be possible to ask the parser at
any time: Has a syntax error occurred? How many declarations have already been parsed successfully? At which point of the input are you currently? -
Fully functional: Needed for incremental parsing. E.g. the user types the
program into a browser window. The parsing state can be stored at each line or
at each paragraph. If the user changes part of the text, a parser stored at
some previous checkpoint can be used to rerun the parsing without reparsing
the whole content. -
Indentation sensitivity: This was one of the hardest requirements. But modern
programming languages tend to be indentation sensitive as can be seen from
Haskell or Python or the overwhelming success of yaml as a configuration
language. Hierarchical structures are most readable for humans if presented as
an indented structure and not with begin end or similar syntactic sugar.Fortunately I have found a good paper about layout parsing which is
not too difficult to implement for combinator parsing.
I have started to write combinator parser for my project. There has been some
difficulties, but it worked out fine.
Since the design of the parser is generic and has nothing to do with my specific
project, I have unbundled the parser from my project and released within a
library called fmlib which can be installed via opam and used in any
other project.