My way from LALR parsing to combinator parsing

I have always been fascinated by LR parsing. The concept is amazing. You write a
context free grammar and there is a yacc like parser generator which transforms
the grammar into a program which does a table driven bottoms up parsing. The
parser generator accepts all types of recursion and converts it to a parsing
table.

In terms of performance table driven parsers are unbeatable. Therefore whenever
I had to parse some syntactic constructs in a programming language I searched
for a yacc like parser generator.

The same happened as I started to program in ocaml. In the standard library
there is a module called ocamlyacc which is more or less compatible with the
standard yacc for c. And there is this excellent tool called menhir which
gives a lot of help in writing context free grammars correctly.

I have never considered top down parsing like parsing expression grammars which
are used in combinator parsing libraries. The avoidance of left recursion seemed
to be unnatural.

But every time using LR parsing I had to fight against shift reduce or even
worse reduce reduce conflicts. For simple grammars the conflicts could be
resolved easily. But for more complicated grammars the resolution of the
conflicts took a significant effort.

Furthermore I wanted to be able to run my parsers within the browser. Since
nearly all parser generators are designed to read their input from a file
(usually via some lexer) their usuage within a browser required some complicated
workarounds.

This made me look into combinator parsing. I have looked into the most prominent
combinator parser parsec and I first discovered two things:

  • The avoidance of left recursion is never a problem, because there are
    combinators which allow the parsing of zero or more occurrences of a construct
    easily.

  • The biased choice construct avoids the fight against shift reduce or reduce
    reduce conflicts.

Then I discovered this <?> combinator which is a great tool in writing some
comprehensible syntax error messages.

Moreover writing combinator parsers is programming in your functional
programming language. No 2 pass compiling, just compile your functions.

Unfortunately Haskell is not my implementation language. I wanted to implement
my parsers in ocaml.

As the appetite for combinator parsing was growing, my requirements were growing
as well. I wanted a parser satisfying the following requirements:

  • Inversion of control: I.e. the parser should not control the input of the
    token. The token should be pushed into the parser. This requirement allows the
    usage of the parser in any environment. In the browser I just get string and
    can push the string content into the parser. In a console program a read the
    file content or the output of another program or the content of a tcp stream
    chunk by chunk and push the chunk content into the parser.

  • Access to the state at any time: It should be possible to ask the parser at
    any time: Has a syntax error occurred? How many declarations have already been parsed successfully? At which point of the input are you currently?

  • Fully functional: Needed for incremental parsing. E.g. the user types the
    program into a browser window. The parsing state can be stored at each line or
    at each paragraph. If the user changes part of the text, a parser stored at
    some previous checkpoint can be used to rerun the parsing without reparsing
    the whole content.

  • Indentation sensitivity: This was one of the hardest requirements. But modern
    programming languages tend to be indentation sensitive as can be seen from
    Haskell or Python or the overwhelming success of yaml as a configuration
    language. Hierarchical structures are most readable for humans if presented as
    an indented structure and not with begin end or similar syntactic sugar.

    Fortunately I have found a good paper about layout parsing which is
    not too difficult to implement for combinator parsing.

I have started to write combinator parser for my project. There has been some
difficulties, but it worked out fine.

Since the design of the parser is generic and has nothing to do with my specific
project, I have unbundled the parser from my project and released within a
library called fmlib which can be installed via opam and used in any
other project.

10 Likes

Nice specification, and even a proposed solution near the end!

I was curious of what is available in opam already:

# opam search parse | grep -i comb
angstrom                           --          Parser combinators built for speed and memory-efficiency
asn1-combinators                   --          Combinators for expressing ASN.1 grammars in OCaml
ezxmlm                             --          Combinators for parsing and selection of XML structures
mparser                            --          A simple monadic parser combinator library
opal                               --          Self-contained monadic parser combinators for OCaml
ostap                              --          Parser-combinator library
pacomb                             --          Parsing library based on combinators and ppx extension to write languages
planck                             --          A small monadic parser combinator library
transept                           --          Generalized parser combinator library
3 Likes

See the documentation of angstrom to find a comparison of the features of some of them.

There are a lot, because combinator parsers are easy to write and easy to use. Much easier to write than LALR parsers.

For my purposes the most important requirements of being incremental (inversion of control) and indentation sensitive are to the best of my knowledge not satisfied by the available parsers. Therefore I wrote my own one. Before writing it I have looked into some of them.

2 Likes

Zero or more of an expression e+
One or more of an expression e*

(from https://hbr.github.io/fmlib/odoc/fmlib/Fmlib_parse/index.html)

Might be a typo, I used to see * for zero or more and + for one or more.

Thanks for the hint. It is a typo. * is zero or more and + is one or more. No difference from standard usage.

It should be correct now.

Regards
Helmut

tried to instal it, I have no idea what to even import to use the library. I’ve tried a bunch of variations of Fmlib.Parse, Fmlib_parse, Fmlib.Fmlib_parse, etc.

If you use dune, you can use the parsing functions with the following line in your dune file

(libraries
    fmlib.fmlib_parse
)

In the source code you just use the module Fmlib_parse. Either open it or use the fully qualified path like Fmlib_parse.Character.Make (...).

The naming convention in opam seems to be <package_name>.<module_name> all in lowercase in the dune file.

In many opam provided libraries the package name and the module name are identical. If this is the case (libraries package-name) seems to be sufficient. However fmlib has three toplevel modules fmlib_std, fmlib_pretty and fmlib_parse.

Maybe I should have made three packages where the package name and the name of the toplevel module are the same.

2 Likes