Unicode-aware parser combinators

For my application (a dependently typed proof assistant) I need a parser with (among other things) two properties:

  • It is aware of UTF-8. Yes, any OCaml string can contain UTF-8 data, but it seems to me (correct me if I’m wrong) that parsing operations like “consume characters until you reach a close-parenthesis” are much easier to express correctly if “character” automatically refers to a unicode code point rather than a byte, since otherwise you might accidentally run into a byte that looks like a “close parenthesis” (or whatever) in the middle of a multibyte code point.
  • It can be modified at run-time, so that the user of the proof assistant can declare new mixfix notations with specified associativity and precedence. I believe (again, correct me if I’m wrong) that this will be much easier with a parser combinator library than with a parser generator like menhir, since the latter generates code that then has to be compiled with the application.

I haven’t been able to find a parser library for OCaml that satisfies both of these properties: the parser combinator libraries like angstrom, mparser, and opal that I’ve found are not, as far as I can tell, UTF-8 aware, being based on OCaml’s 8-bit char type.

Can someone recommend a parser library with both of these properties? Or suggest a different way to achieve what I want?

1 Like

Wait, really? I thought all Unicode code points were completely distinct from ASCII characters i.e. the ones with higher codes than the ASCII range?

You’re right, sorry, that’s a bad example: UTF-8 is self-synchronizing, the subsequent bytes of a multibyte code word are not ASCII bytes. But I have seen warnings on parser libraries saying that they do not support unicode, so I guess I lept incorrectly to the guess that this was the problem. What then is the problem? Or is there no problem?

I guess a straightforward example is that it would be useful to have a built-in command to “parse a single character” and have “character” refer to a code point. I’ve certainly run into problems using a non-utf8-aware regexp library, e.g. if you try to put a multibyte character inside a regexp character class, it gets interpreted as a choice between its constituent bytes. Maybe this won’t arise if I use a parser library “correctly”, but it would be nice to be confident that the library author had dealt with all the utf-8 questions so I can think at the level of abstraction of code points.

Are you at all willing to settle on a fixed set of lexemes ? If so, maybe you could use sedlex for lexing (it supports unicode) and then some parser-combinator library that works with sedlex ? Just a thought.

By “a fixed set of lexemes” I mean something like what OCaml itself does: declare what the operators are beforehand, but allow them to be defined much later.

I am willing to fix in advance what counts as a token. (For instance, maybe parentheses and curly braces are always their own tokens, and otherwise tokens are separated only by whitespace; this is kind of like what Agda does.) I’m not willing to fix in advance which tokens are names and which are the components of mixfix operators, or what their associativity or precedence is – OCaml does do that too, but I guess that’s probably not what you meant.

So if you can recommend a parser-combinator library that works on an undifferentiated stream of tokens that would be output by something like sedlex, I would be happy to try that. All the parser-combinator libraries I’ve found so far seem designed to work on a stream of chars.

Ah. I never use parser-combinator libraries, so I can’t help there. I’m going to assume that you also want GLR, and not LL(1) ? The reason I mention this is that Coq uses a fork of the Camlp5 extensible grammars (and, I would assume, Camlp5’s lexer, though I don’t actually know). [ETA] Those extensible grammars are LL(1). [/ETA] The Camlp5 extensible grammar code can definitely be used in combination with ocamllex, so I would expect that it can also be used with sedlex (though I’ve never tried). Here’s an example of it being used with ocamllex: https://github.com/camlp5/camlp5/tree/master/tutorials/calc%2Bocamllex .

So what I’m suggesting is:

  1. if you can deal with grammars like what Coq uses for extensibility
  2. then perhaps you can adapt Coq’s grammar-engine to support sedlex
  3. or heck, maybe Coq actually supports unicode? Might be worth asking …
1 Like

I haven’t learned yet what all the different names for parser types mean. (Another nice thing about a parser-combinator library is that I don’t have to.) But if I don’t find anything better, I’ll look into that; thanks!

I know of Coq libraries that do successfully use unicode, including even the standard library, so presumably it is supported.

The Ucs_scan module in my Orsetto project has those properties. Not sure I want to recommend it to you, because having used it in anger for a similar project, I have now commenced a project to refactor the whole thing from top to bottom, and I’m a long way away from releasing that still.

That said, the Orsetto test suite has reasonably good coverage, and the library works— if you can figure out how to use it from the complete lack of tutorial material. There’s even a marginally functional Unicode regular expression implementation in there.

I remember that Angstrom handled basic Unicode just fine - I used it in a production environment. Not sure about more complex cases. Sadly, seems the project looks abandoned and not updated for a while.

I would never use a parser combinator library or a parser generator for a compiler.
The real problems for a parser (and lexer) is resilient error handling and “interactive” usage like as a LSP for editing. Everything else is relatively easy, if you have a working grammar.
For handling user defined precedence and associativity (and to handle associativity and precedence in general) I always use a Pratt parser.

Pratt’s paper: https://tdop.github.io/ PDF: https://web.archive.org/web/20151223215421/http://hall.org.ua/halls/wizzard/pdf/Vaughan.Pratt.TDOP.pdf
Implementation using Elm (can do exactly the same in OCaml): Demystifying Pratt Parsers | Martin Janiczek

@viritrilbia: Have you looked into Fmlib_parse?

Fmlib_parse is a combinator library, therefore it satisfies your second requirement as any combinator library should do.

The character parser in the library works on Ascii characters i.e. is not (yet) unicode aware. I am planning to make it unicode aware in the future.

But there is another option. The library has the possibility to combine a combinator parser with a lexer (either handwritten or generated as some other combinator parser). As far as I know Fmlib_parse is the only combinator library which offers this feature.

Using this feature you can e.g. write a lexer which generates unicode tokens (i.e. unicode code points). Such a lexer should be relatively simple (even ocaml’s standard library has functions which support the writing of a lexer). Then you write a combinator parser which works on unicode tokens.

The documentation of such a two stage combinator parser is not yet very good. But I can offer help if you want to use it.


@hbr That looks like it might be exactly what I want, thanks! I will check it out and maybe get back to you.

With all the discussions about Unicode, what are the obvious pitfalls I’m missing? Long long time ago, we used Menhir in a standard way where our language used Unicode characters for quantifiers and logical connectives. We tried, and it worked; we didn’t do anything Unicode-aware-specific IIRC.

What I wanted to convey is:

1 Like

The main issue is how to describe tokens. Say you want to allow identifiers to contain arbitrary unicode (accented) letters, then writing the proper regular expression is possible but quite painful. You would prefer the lexer to provide such a character category in a builtin way.

1 Like

The problem will be in the lexer, so what you are asking is, why Sedlex instead of OCamllex.

Another problem could be needing Unicode normalisation when comparing identifiers, as they could be written in different ways (like š and ˇ+s).

1 Like

That’s not a problem for parsers. It’s just a matter of normalizing your identifier token data once they have been parsed/lexed.

This was meant as an example of problems using Unicode, not for the parser or lexer (or type checker), but the person writing or using it. Although it would be nice if the lexer would run normalisation on all generated tokens that contain a string.

Not really, it’s better to seperate concerns. There’s quite a few cases (e.g. parsing XML tags) where you do not want to perform that.