OCaml Parser combinator library as powerful as fastparse(Scala)?

I’ve recently discovered the fastparse Scala library GitHub - com-lihaoyi/fastparse: Writing Fast Parsers Fast in Scala and I wonder if we have a similar library in OCaml? I had a look at mparser and angstrom, but none of them seem to allow to easily define mutually recursive parsers, or to easily handle spacing like in fastparse.

Look for example the parser for Scala: fastparse/Exprs.scala at master · com-lihaoyi/fastparse · GitHub
where you can implicitely define the whitespace function the sequencing of parser should ignore, or how all those parsers can be mutually recursive.

I haven’t used those libraries you mention, but a quick glance suggests that they can handle recursion eg for angstrom, via an explicit fix operator.

The problem with trying to use recursive functions directly is usually that you need to have an explicit argument for OCaml to accept a recursive definition. This tends to make the parsers more ugly than they would be eg in Haskell or Scala. Perhaps others can comment further?

One possibility might be to use an object to provide the mutually recursive bindings without the explicit arguments required for plain OCaml recursive functions.

I think angstrom and mparser can easily accommodate mutually recursive parser combinators. We have lots of parser combinator libraries for OCaml, and I’m pretty sure they’re all capable of it.

The parser combinator library in my Orsetto project is another example. Consider its implementation of Unicode regular expressions, ucs_regx.ml. What’s going on there is a parser combinator is used to recognize Unicode Regular Expressions and compile them into deterministic finite automaton, which may then be used to compose lexical analyzers. The Unicode Regular Expression language is quite a bit more involved than the ones we normally see used in the ASCII character set, and there are quite a lot of mutually recursive combinators in there.

2 Likes

Mutual recursion should not be a problem for combinator parsers. Can you give me a simple example of mutual recursion which you think it cannot be parsed by a combinators. If the example is not too complicated, I can try to implemented it via parsing combinators.

Regards
Helmut

As an example you can take a mini-C where you have stmts and expressions and function definitions, but an expression can also be a lambda.

like parsing: if (x == 1) { return (lambda () { return x +1; }); } else { return 0; }
With an AST like:
type expr = Literal of int | Plus of expr * expr | Lambda (string list * stmt)
and stmt = Return of expr | ExprStmt of expr | If of (expr * stmt * stmt) | Assign of string * expr

so just 2 mutually recursive parser combinators.

@aryx

As promised the working definition of a combinator parser based on the library
fmlib.

First the abstract syntax (which I have slightly modified from your post):

(* Abstract syntax *)
type expression =
    | Literal of int
    | Variable of string
    | Plus of expression * expression
    | Lambda of string list * statement list

and statement =
    | Return of expression
    | If of expression * statement list * statement list
    | Assign of string * expression

and definition =
    string * string list * statement list


(* We want to parse a definition, therefore the final type is a definition *)
module Final =
struct
    type t = definition
end

Now the parser module

module Parse =
struct
    include Fmlib_parse.Character.Make (Unit) (Final) (String)

    module String_set = Set.Make (String)

    let keywords: String_set.t =
        String_set.(empty |> add "if" |> add "return" |> add "lambda")


    let whitespace: int t =
        skip_zero_or_more (char ' ' </> char '\n' </> char '\r')


    let keyword (kw: string): string t =
        let* str = backtrack (string kw) kw in
        let* _   = whitespace in
        return str


    (* A raw identifier is a string starting with letter, followed by zero or
     * more letters, digits or underscores. *)
    let raw_name: string t =
        let* first = letter in
        zero_or_more
            (String.make 1 first)
            (fun c str -> str ^ String.make 1 c)
            (letter </> range '0' '9'  </> char '_')


    (* An identifier is a raw identifier which is not a keyword. Whitespace is
     * stripped. *)
    let identifier: string t =
        let* id =
            backtrack
                (
                    let* name = raw_name in
                    if String_set.mem name keywords then
                        unexpected "identifier"
                    else
                        return name
                )
                "identifier"
        in
        let* _ = whitespace in
        return id


    (* Digit sequence with whitespace stripped. *)
    let number: expression t =
        let* n =
            one_or_more
                Fun.id
                (fun n d -> 10 * n + d)
                digit
        in
        let* _ = whitespace in
        return (Literal n)


    (* [p1] and [p2] are the opening and closing parentheses. *)
    let parenthesized (p1: char) (p2: char) (p: unit -> 'a t): 'a t =
        let* _ = char p1 in
        let* _ = whitespace in
        let* a = p () in
        let* _ = whitespace in
        let* _ = char p2 in
        let* _ = whitespace in
        return a


    (* '(' name1, name2, ... ')' *)
    let formal_arguments: string list t =
        let separator =
            let* _ = char ',' in
            whitespace
        in
        parenthesized
            '('
            ')'
            (fun () ->
                 let* lst =
                     one_or_more_separated
                         (fun first -> [first])
                         (fun lst _ nxt -> nxt :: lst)
                         identifier
                         separator
                     </>
                     return []
                 in
                 return (List.rev lst))


    (* Now the mutual recursive definition of [expression], atomic expression,
     * statement and statement list. *)
    let rec expression (): expression t =
        let plus =
            let* _ = char '+' in
            whitespace
        in
        one_or_more_separated
            Fun.id
            (fun e1 _ e2 -> Plus (e1, e2))
            (atomic ())
            plus


    and atomic (): expression t =
        number
        </>
        (
            let* id = identifier in
            return (Variable id)
        )
        </>
        (
            let* _ = keyword "lambda" in
            let* fargs = formal_arguments in
            let* stmts = statements () in
            return (Lambda (fargs, stmts))
        )
        </>
        parenthesized '(' ')' expression

    and statement (): statement t =
        (
            let* _ = keyword "return" in
            let* e = expression () in
            return (Return e)
        )
        </>
        (
            let* _ = keyword "if" in
            let* bexp = parenthesized '(' ')' expression in
            let* stmts1 = statements () in
            let* _ = keyword "else" in
            let* stmts2 = statements ()
            in
            return (If (bexp, stmts1, stmts2))
        )
        </>
        (
            let* id = identifier in
            let* _  = char '=' in
            let* _  = whitespace in
            let* expr = expression () in
            return (Assign (id, expr))
        )


    and statements (): statement list t =
        parenthesized
            '{'
            '}'
            (fun () ->
                 let* statements =
                     one_or_more_separated
                         (fun stmt -> [stmt])
                         (fun stmts _ nxt -> nxt :: stmts)
                         (statement ())
                         (let* _ = char ';' in whitespace)
                 in
                 return (List.rev statements))


    (* name (arg1, arg2, ... ) { body } *)
    let definition: definition t =
        let* id = identifier in
        let* fargs = formal_arguments in
        let* stmts = statements () in
        return (id, fargs, stmts)
end

Finally an inline test which proves that the parser works as expected.

let%test _ =
    let open Parse in
    let p =
        Parser.run_on_string
            "f1 (a,b,c) { return (lambda (i,j) { a = 10; return i }) }"
            (make () definition)
    in
    Printf.printf "pos %d, %d\n" (Parser.line p) (Parser.column p);
    Parser.has_succeeded p
2 Likes

Nice! Compared to fastparse though, it’s a bit annoying to have to explicitely call backtrack() and handle whitespace. Fastparse by default allows to backtrack, and by doing certain import it also handle automatically skipping whitespaces (or comments). This make the parser combinators closer to the original grammar of the language you want to parse.

Still, nice that one parser combinator library (fmlib) allows recursion and mutually recursive combinators without requiring ‘fix’ or other hacks when ‘let rec’ and ‘and’ can do fine.

@aryx

A parser which backtracks by default is not very efficient. Therefore the explicit calls to backtrack only when backtracking is necessary. The explicit handling of whitespace makes it possible to make indentation sensitive parsers where whitespace (at least at the start of a line) is significant.

But note that the library is just a toolkit. You can use a generic parser (and not the character parser) to build any parser you want, even a parser which strips whitespace implicitly and backtracks implicitly.

P.S.: Since I am the author of fmlib I have used fmlib to demonstrate the power of combinator parsers. However I would be very surprised if the other combinator parsers available in ocaml wouldn’t have the same power with respect to mutual recursion.

1 Like

The JSON parser in Orsetto (see Json_scan.ml) is about 500 lines, about half of which is the lexical analyzer, and about 100 lines is some logic that I’m trying to make obsolescent in a future release.

Parsing JSON objects is an illustrative example of where a backtracking feature of some kind is necessary. First, all the object members need to scanned into a map of member names associated with the stream position of the value, then you need to backtrack to parse according to the type associated with the member name. In the Orsetto framework (with the forthcoming 1.1 release), this is generalized by the Cf_structure_scan module.

Allows you to write parsers that looks like this:

type abc = Record of { a: string; b: int option; c: float }

module S = Json_scan
open S.Structure.Affix

let parser : abc S.t =
  "alfa" %: S.text @>
  "bravo" %? S.integer @>
  "charlie" %= (S.float, 1.0) @>
  S.Structure.return <@ fun a b c -> Record { a; b; c }

The CBOR parser contains a similar mechanism for parsing maps.

@jhw
Why do you think that backtracking is necessary for parsing json? I am sure that I can write a json parser with a combinator library which does not use backtracking.

In the example I have given, backtracking was necessary because in a character parser (i.e. a lexerless parser) it is not possible to distinguish keywords from identifiers because they have the same grammatical structure. In json, I don’t see a similar necessity.

It is possible to parse keywords and identifiers without backtracking with parsing combinators by using a lexer combinator which delivers its tokens to a parser which parses the high level structures.

I should clarify. It’s necessary for a validating parser. If all you want to do is recognize any well-formed syntax tree, then you don’t need any backtracking.

How interesting. So when you see

{ "a": <json1>, "b": <json2>, "c": <json3> }

you remember the positions of the start of the sub-trees, and then go back and validate-and-parse ? At first glance, to know where the sub-trees end, would seem to require at least parenthesis-matching of the entire subtree ? So either you keep that data-structure around (which amounts to a tree of a sort), or you have to re-scan subtrees (e.g. in the case where <json1> is itself an object ?

Just curious how you do it. Obviously, for some JSON Schema (we could call them “conjunctive” or non-backtracking) this isn’t necessary. And those schema aren’t so negligible: they include enough to be able to support constructor-datatypes.

@jhw
What do you mean by validating parser?

A json parser should accept valid json and reject invalid json. Otherwise it would not be a json parser.

And why do you think that validating requires backtracking?

Well I don’t actually have a JSON Schema compiler yet, and I’m not sure I want to write one, but how Orsetto will handle this sort of thing is by parsing enough of the object between the delimeters to validate the syntax is well-formed, but it won’t keep the whole abstract tree around for each field. It will just memorize the stream position for the start of each field, check that all the fields conform to the object schema, then backtrack and apply the specialized parser corresponding to each field type. Ultimately, a constructor function with arguments for each of the parsed member values will be called to construct the output fo the parser as an application data type.

I mean a parser that rejects well-formed syntax if it’s invalid according to the data model.

Right, and to do this for the start of the second field, requires full parenthesis-matching of the text of the first field, yes? It sounds like this approach re-scans a deeply-nested subtree approximately O(depth-of-nesting). Or something like that. That’s my question.

I think these are usually separate concerns. The parser constructs an AST, not caring about the data model, then a decoder transforms the well-formed AST into the required data (of course this can fail too).

Mmmm … so, in the case of languages used for wireline data-serialization, parsing directly into application data-structures is actually pretty important. Parsing first into JSON (or for that matter, XML DOM) and then into application datastructures is much more expensive (that intermediate tree gets build, traversed, and discarded).

One of the singular faults of JSON is that it’s difficult to do this: at least with XML Schema, it was feasible. Specifically, the fact that JSON keys are not ordered is very problematic for efficient demarshalling.

P.S. by “ordered” I don’t mean “sorted”, but rather, that you cannot specify for an object, what the order in which keys will be seen on the wireline. Obviously if this were something you could specify, it would make JSON much, much, much less human-friendly. But that’s also what makes it a much less feasible wire-format for machine-to-machine communication.

P.P.S. of course, protobufs and thrift also share the property that fields do not need to be presented in any particular order. But both of those have the property that the type of any one field is not dependent on the value of other fields (setting aside “oneOf” which is a special case). Which property, JSON does not enjoy (at least, not with JSON Schema as the IDL).

1 Like

Yeah, you’re right of course for specific formats which allow defining schemas, like Thrift. I guess I was thinking more about JSON.

Yes, I think that that’s what @jhw is getting at (and that’s what JSON Schema ought to be for, if they’d only designed it right). If you look at the way JSON gets used in cloud computing (for instance), it’s clear that that data needs to be strongly-typed, and that it’s meant to be consumed by programs which themselves ought to be demarshalling into precise application data-structures.