OCaml Parser combinator library as powerful as fastparse(Scala)?

jhw · April 16, 2021, 11:50pm

Yeah, but the first pass is significantly less expensive that the second. You only have to allocate nodes in a list for memorizing the starting positions of each field. Outside of that, the first pass is linear in space for the parenthesis stack, and in time in the size of the input.

To use this effectively, Orsetto first converts the source text to a Uchar.t Seq.t with a UTF transport, then scans that into a persistent sequence (can be an array) of Json_event.t values with a Unicode lexical analyzer, and finally uses a second parser combinator that uses Json_event.t for the input symbol. Passing the nst/JSONTestSuite is hard, but this approach seemed to let me pass tests that other OCaml JSON parsers fail.

I’ll admit the Unicode lexical analyzer is total overkill in the case of JSON because code points outside the ASCII range are only legal inside string literals. But I used it anyway because Orsetto already has a Unicode lexical analyzer for other reasons, and I put a lot of work into making the DFA state tables fast.

The parser does have a function that will accept any well-formed JSON syntax and return the whole tree. It would be the usual pain in the neck to decode that into an application data structure. I include it mainly because it’s useful to be able to pass around raw syntax trees as opaque representations of an external type that you don’t decode.

jhw · April 17, 2021, 12:00am

I do not very much love JSON Schema.

In my day job, we use something very much like G* protocol buffers — mostly a workalike with additional features that G* doesn’t think are important. There is something to be said for an all-singing all-dancing IDL that compiles down to scanner and formatters for your application data structures in your programming language. I cannot deny that.

I’m very skeptical that IDL specifications make for good protocol specifications for inteoperability at Internet scale, and I think JSON Schema is a terrible substitute for a proper IDL, because after you’ve validated that JSON text according to its schema, you’re still left with an abstract syntax tree that you need to navigate with something like JSON Path (or handcrafted code that traverses the tree).

Anyway, all of this is to explain why I’m not enthusiastic about JSON and I much prefer CBOR where I can use it instead. The CDDL language for specifying CBOR and JSON grammars is, in my view, better than JSON Schema, and it’s notable that the CDDL specification goes out of its way to tell you that CDDL is not an IDL and doesn’t want to be one.

anentropic · November 3, 2022, 4:41pm

I’ve no idea who is correct, but since I was just reading this…

Angstrom is a parser-combinator library that makes it easy to write efficient, expressive, and reusable parsers suitable for high-performance applications.
…
Parsers are backtracking by default and support unbounded lookahead.

So there’s at least one parser lib which is backtracking by default and also makes claims of efficiency and performance.

hbr · November 4, 2022, 3:20pm

Certainly backtracking has a cost. During parsing you have to store the already consumed tokens. When backtracking occurs you have to unconsume the consumed tokens.

If you store all the consumed tokens in a random accessible data structure then unconsume operation has no additional cost. You just reset the input pointer. However you have to store all incoming tokens. When you parse a string even this does not cost anything, because the string is the data structure. If you parse a file, then storage costs.

The next performance relevant fact is the waisted parsing done already. All the partial syntax trees already constructed are thrown away. If the construction has been done on mutable structures, then some form of undo is necessary.

The parsing library index (fmlib_parse.index) is not backtracking by default. If you want a backtrackable combinator you can ask for it. Just wrap your combinator p into backtrack p and the resulting combinator either succeeds or fails by not consuming tokens. I.e. after a backtrackable combinator has failed, another alternative can be tried. If you don’t ask for a backtrackable combinator, then you don’t pay the price. This is a deliberate design choice. The Angstrom library obviously has made a different design choice.

craff · November 25, 2022, 12:43am

Hello,

You can have a look at pacomb (available on opam & github). It uses parser combinators that run in parallel in some sense (using CPS style and a “scheduler”) and therefore do not backtrack in the usual sence. Although pacomb is not really intended to be used directly but as a target to compile a BNF.

More generally, with OCaml 5.0 and effect, you could probably produce very efficient combinators that parse your input in parellel.

Here is the comment explaing our combinators from file comb.ml

   Combinators are a standard approach to parsing in functional language.  The
   major advantage of combinators is that they allow manipulating grammars as
   first class values.  However, they generally suffer from two major defects.
   We start this file by a global description of the original feature of this
   library.

    - Incomplete semantics.  A grammar "(a|b)c" may fail to backtrack and try
   "bc" if parsing for "ac" fails in "c". This is traditionally solved with
   continuation: combinators must be given the function that will be used to
   parse the remaining input. In general, parsing combinator returning value of
   type ['a] with continuation will have type [env -> (env -> 'a -> bot) -> bot]
   where [env] is the type maintaining the data necessary for parsing (like the
   current input stream) and [bot] is then type of [false]. ['a cont = 'a -> env
   -> bot] is thefore the continuation type.

   - Exponential semantics.  The parsing problem for context-free grammars can
   be solved in polynomial time (O(n³) implementation are often proposed).  As
   combinator backtracks, they usually lead to an exponential behaviour.  This
   is solved here by a [cache] combinator, that avoids parsing twice the same
   part of the input with the same grammar.

   - backtracking  is also a problem, because we need to go back in the input to
   try  other alternatives.   This means  that the  whole input  must remain  in
   memory.  This is solved by terminals returning immediately instead of calling
   the continuation. A "scheduler" manages the continuations and the alternative
   branches for  parsing. This means  that instead of [bot],  we use a  type ['a
   res]  with  two  constructors,  one for  continuation,  one  for  alternative
   parsing.  A global  queue  stored  in all  environment  is  maintened by  the
   scheduler, and  the next action  is taken  from the top  of the queue  by the
   scheduler. The ordering in the queue is a lexicographic ordering on the input
   position and definition dependance (if A  calls B, the continuation of B must
   be called before the continuation of A). The former ensure that all terminals
   are parsed in parallel, and therefore that  the beginning of the input can be
   collected by  the GC.   The latter  is necessay for  the cache  combinator to
   work.

   - In many  technics that cover  ambiguous  grammars: right  recursive grammar
   will try to  compute the action for  all accepted prefix of  the input, often
   leading to quadratic parsing time.  This is solved by delaying the evaluation
   of the semantics,  but not too much  so that the user can  call the [give_up]
   function to reject some parses from the action code. More generaly, to ensure
   O(1)for most step of parsing, we use two specific type to represent the value
   returned by parsing and value transforming these

   -  Specific combinators (lr, lr_pos  and mlr) are provided  to transform left
   recursive grammar  that loops with  combinator into non left  recursive ones.

    - A  blank  fonction  is used  (and it can  be  changed  during parsing)  to
   compensate  the  scannerless  nature  of  combinators  and  deal  with  blank
   characteres.

Cheers,
Christophe

craff · November 25, 2022, 1:06am

The problem is usually left recursion.

hbr · November 25, 2022, 7:35am

That’s right. Left recursion is not allowed in parsing expression grammars (PEGs). However this is usually not a problem. Left factoring is the solution.

craff · November 25, 2022, 5:52pm

Elimination of left recursion often introduces some issue with stack size in the action code. It is easier and more direct to translate a left recursive BNF using a specific fixpoint combinator dedicated to left recursion.

Topic		Replies	Views
Paper about fast parser combinators in OCaml? Ecosystem	1	833	March 31, 2022
[ANN] Transept 0.1.0: Generalised Parser Combinators Community announce	1	1387	March 5, 2020
My way from LALR parsing to combinator parsing Community	6	2735	March 12, 2021
[ANN] Fmlib 0.5.5 with emphasis on combinator parsing with lexers Ecosystem	2	368	September 26, 2023
Combinaml.0.1 released - a customizable parser combinator library Community announce	1	537	July 11, 2023

OCaml Parser combinator library as powerful as fastparse(Scala)?

Related topics