[ANN] Layout Parsing and Nicely Formatted Error Messages

In a previous post I have described my way from LALR parsing to combinator
parsing. Now I am more and more convinced that combinator parsing is really a
good and flexible way to write parsers. The new release 0.5.0 of Fmlib focuses
on layout parsing and nicely formatted error messages by using combinator
parsing.

The library can be installed via opam by opam install fmlib. There is a github repository hosting the source code. The API can be found online. See also
a tutorial on combinator parsing.

Layout Parsing

Most programming languages express hierarchical structures by some kind of
parentheses. Algol like languages use begin end, C like languages use curly
braces {, } to enclose blocks of code. Since blocks can be nested inside
blocks, the hierarchical or tree structure is well expressed by the syntax.

For the human reader blocks are usually indented to make the hierarchical
structure graphically visible. Programming languages like Haskell and
Python ommit the parentheses and express the hierarchical structure by
indentation. I.e. the indentation is part of the grammar. This is pleasing to
the eye, because many parentheses can be ommitted.

The hierarchical structure in the following schematical source file is
immediately visible without the need of parentheses.

xxxxxxxxxxx
    xxx
    xxx
        xxxxxxx
xxxxxxxx
    xxx

Lower level blocks are indented with respect to their parent block and siblings
at the same level are vertically aligned.

Because of this good readability configuration languages like yaml have
become very popular.

Unfortunately there are not many parsers available which support indentation
sensitivity. The library Fmlib has support to parse languages whose grammar uses indentation to structure blocks hierarchically.

There are only 3 combinators needed to introduce layout parsing in combinator
parsing. Suppose that p is a combinator parsing a certain contruct. Then we
have

  • indent 4 p: Parse the construct described by p indented at least 4
    columns relative to its environment

  • align p: Parse the construct desribed by p aligned vertically with its
    siblings

  • detach p: Parse the construct described by p without any indentation or
    alignment restrictions

In order to parse a list of ps vertically aligned and indented relative to its
environment by at least one column we just write

one_or_more (align p) |> indent 1

and parse a structure with the schematic layout

xxxxxxxx

    pppppppp

    pppppp

    pppp

xxxxx

User Frienly Error Messages

It is important to for a parser writer to make syntax error messages user
friendly. Fmlib has some support to write friendly error messages. There is the operator <?> copied from the Haskell library parsec which helps to equip combinators with descriptive error message in case they fail to parse the construct successfully.

At the end of a failed parsing, the syntax (or semantic) errors have to be
presented to the user. Suppose there is a combinator parser for a yaml like
structure. The library writes by default for you error messages in the form

1 |
2 | names:
3 |      - Alice
3 |      - Bob
4 |
5 |   category: encryption
      ^

I have encountered something unexpected. I was
expecting one of

    - at 3 columns after

        - sequence element: "- <yaml value>"

    - at 2 columns before

        - key value pair: "<key>: <yaml value>"

    - end of input

The raw information (line and column numbers, individual expectations, failed
indentation or alignment expectation) is available as well so that you can
present the error messages to the user in any different form.

There is also a component Fmlib_pretty in the library for pretty printing any ascii text.

4 Likes