Yeah, but the first pass is significantly less expensive that the second. You only have to allocate nodes in a list for memorizing the starting positions of each field. Outside of that, the first pass is linear in space for the parenthesis stack, and in time in the size of the input.
To use this effectively, Orsetto first converts the source text to a Uchar.t Seq.t with a UTF transport, then scans that into a persistent sequence (can be an array) of Json_event.t values with a Unicode lexical analyzer, and finally uses a second parser combinator that uses Json_event.t for the input symbol. Passing the nst/JSONTestSuite is hard, but this approach seemed to let me pass tests that other OCaml JSON parsers fail.
I’ll admit the Unicode lexical analyzer is total overkill in the case of JSON because code points outside the ASCII range are only legal inside string literals. But I used it anyway because Orsetto already has a Unicode lexical analyzer for other reasons, and I put a lot of work into making the DFA state tables fast.
The parser does have a function that will accept any well-formed JSON syntax and return the whole tree. It would be the usual pain in the neck to decode that into an application data structure. I include it mainly because it’s useful to be able to pass around raw syntax trees as opaque representations of an external type that you don’t decode.
In my day job, we use something very much like G* protocol buffers — mostly a workalike with additional features that G* doesn’t think are important. There is something to be said for an all-singing all-dancing IDL that compiles down to scanner and formatters for your application data structures in your programming language. I cannot deny that.
I’m very skeptical that IDL specifications make for good protocol specifications for inteoperability at Internet scale, and I think JSON Schema is a terrible substitute for a proper IDL, because after you’ve validated that JSON text according to its schema, you’re still left with an abstract syntax tree that you need to navigate with something like JSON Path (or handcrafted code that traverses the tree).
Anyway, all of this is to explain why I’m not enthusiastic about JSON and I much prefer CBOR where I can use it instead. The CDDL language for specifying CBOR and JSON grammars is, in my view, better than JSON Schema, and it’s notable that the CDDL specification goes out of its way to tell you that CDDL is not an IDL and doesn’t want to be one.
I’ve no idea who is correct, but since I was just reading this…
Angstrom is a parser-combinator library that makes it easy to write efficient, expressive, and reusable parsers suitable for high-performance applications.
…
Parsers are backtracking by default and support unbounded lookahead.
So there’s at least one parser lib which is backtracking by default and also makes claims of efficiency and performance.
Certainly backtracking has a cost. During parsing you have to store the already consumed tokens. When backtracking occurs you have to unconsume the consumed tokens.
If you store all the consumed tokens in a random accessible data structure then unconsume operation has no additional cost. You just reset the input pointer. However you have to store all incoming tokens. When you parse a string even this does not cost anything, because the string is the data structure. If you parse a file, then storage costs.
The next performance relevant fact is the waisted parsing done already. All the partial syntax trees already constructed are thrown away. If the construction has been done on mutable structures, then some form of undo is necessary.
The parsing library index (fmlib_parse.index) is not backtracking by default. If you want a backtrackable combinator you can ask for it. Just wrap your combinator p into backtrack p and the resulting combinator either succeeds or fails by not consuming tokens. I.e. after a backtrackable combinator has failed, another alternative can be tried. If you don’t ask for a backtrackable combinator, then you don’t pay the price. This is a deliberate design choice. The Angstrom library obviously has made a different design choice.
You can have a look at pacomb (available on opam & github). It uses parser combinators that run in parallel in some sense (using CPS style and a “scheduler”) and therefore do not backtrack in the usual sence. Although pacomb is not really intended to be used directly but as a target to compile a BNF.
More generally, with OCaml 5.0 and effect, you could probably produce very efficient combinators that parse your input in parellel.
Here is the comment explaing our combinators from file comb.ml
Combinators are a standard approach to parsing in functional language. The
major advantage of combinators is that they allow manipulating grammars as
first class values. However, they generally suffer from two major defects.
We start this file by a global description of the original feature of this
library.
- Incomplete semantics. A grammar "(a|b)c" may fail to backtrack and try
"bc" if parsing for "ac" fails in "c". This is traditionally solved with
continuation: combinators must be given the function that will be used to
parse the remaining input. In general, parsing combinator returning value of
type ['a] with continuation will have type [env -> (env -> 'a -> bot) -> bot]
where [env] is the type maintaining the data necessary for parsing (like the
current input stream) and [bot] is then type of [false]. ['a cont = 'a -> env
-> bot] is thefore the continuation type.
- Exponential semantics. The parsing problem for context-free grammars can
be solved in polynomial time (O(n³) implementation are often proposed). As
combinator backtracks, they usually lead to an exponential behaviour. This
is solved here by a [cache] combinator, that avoids parsing twice the same
part of the input with the same grammar.
- backtracking is also a problem, because we need to go back in the input to
try other alternatives. This means that the whole input must remain in
memory. This is solved by terminals returning immediately instead of calling
the continuation. A "scheduler" manages the continuations and the alternative
branches for parsing. This means that instead of [bot], we use a type ['a
res] with two constructors, one for continuation, one for alternative
parsing. A global queue stored in all environment is maintened by the
scheduler, and the next action is taken from the top of the queue by the
scheduler. The ordering in the queue is a lexicographic ordering on the input
position and definition dependance (if A calls B, the continuation of B must
be called before the continuation of A). The former ensure that all terminals
are parsed in parallel, and therefore that the beginning of the input can be
collected by the GC. The latter is necessay for the cache combinator to
work.
- In many technics that cover ambiguous grammars: right recursive grammar
will try to compute the action for all accepted prefix of the input, often
leading to quadratic parsing time. This is solved by delaying the evaluation
of the semantics, but not too much so that the user can call the [give_up]
function to reject some parses from the action code. More generaly, to ensure
O(1)for most step of parsing, we use two specific type to represent the value
returned by parsing and value transforming these
- Specific combinators (lr, lr_pos and mlr) are provided to transform left
recursive grammar that loops with combinator into non left recursive ones.
- A blank fonction is used (and it can be changed during parsing) to
compensate the scannerless nature of combinators and deal with blank
characteres.
That’s right. Left recursion is not allowed in parsing expression grammars (PEGs). However this is usually not a problem. Left factoring is the solution.
Elimination of left recursion often introduces some issue with stack size in the action code. It is easier and more direct to translate a left recursive BNF using a specific fixpoint combinator dedicated to left recursion.