[ANN] Parseff: parser combinator library for OCaml 5

Hi everyone,

I’m sharing Parseff, a parser combinator library for OCaml 5.

If you like parser combinators but don’t love writing everything in monadic style, Parseff is built for that. Parsers are plain functions (unit -> 'a) and you compose them in direct style, while Parseff handles backtracking and streaming under the hood (using effects!)

Why Parseff

  • Direct-style API: write sequential OCaml code, no >>=, let*, or applicative operator chains required.
  • Typed errors: raise domain errors with polymorphic variants via Parseff.error.
  • Streaming input: same parser works with Parseff.parse (string) and Parseff.parse_source (stream).
  • Zero-copy + fused ops: span APIs and fused operations for hot parsing paths.
  • Domain-safe execution: parse calls are self-contained (no global mutable state).
  • Automatic backtracking with Parseff.or_
  • Minimal dependency footprint: only re for regex

Performance

In the included benchmarks, Parseff is faster than Angstrom and MParser:

  • around ~2x in fair comparisons
  • up to ~4x+ with zero-copy optimized paths
    Benchmark details and code are in the repo docs.

Documentation

I’ve put a lot of effort into the docs (quick start, guides, and API pages) to keep Parseff easy to learn. A bit proud of using odoc’s Markdown backend plus mdx, I can generate and check examples on the fly, so snippets stay type-safe.

https://davesnx.github.io/parseff

Install

opam install parseff -y

Links

If you try it, I’d really value feedback on: API ergonomics, error handling experience or missing combinators or docs gaps!

Thanks!

35 Likes

That looks pretty cool! I was quickly browsing through the documentation, and it is impressive so far.

For the performance, the quick start page link to https://davesnx.github.io/parseff/bench/bench_vs_angstrom.ml which gives a 404. I guess it was meant to be parseff/bench/bench_vs_angstrom.ml at main · davesnx/parseff · GitHub that is on some other page.

For the benchmark, you get quite impressive result! That’s got me excited, I am planing to try your library, I have a project where I am trying to match so C code speed. And I was comparing different parsing option.

As you mention closure in Angstrom as giving overhead, I was wondering if flambda could reduce a bit the gap.

1 Like

Thanks. Fixed the link.

Performance was a great surprise. Effects have the “stack switching” that penalizes a bit since it makes the program jump into another memory region. But we also win a lot with the backtracking on Parseff.or_ where It can easily resume, instead of “replay” or any other mechanisms.

I wanted to have a version with oxcaml and see where it can improve, but unrelated with the perf I also did parseff since angstrong has a few issues that have been silent for a while.

1 Like

Hello,

Interesting !

A real pain when writing parser is to handle locations with line number and character position on line. The Parseff interface seems to only provide a position from the beginning of the stream. Do you think it would be possible de provide more convenient positions, with absolute position, but also line number and position on line ?

Another question: do you think it could be possible to provide primitives to parse Uchar.t instead of char ?

Regards.

1 Like

Great job! Do you recommend any resource about parsers or domain specific languages, within the context of OCaml or Functional Programming?

Interesting, I have been working on a parser combinator library too and I considered at some point using effects for every rewind as you do, I gave up because of the typing limitations (I’m trying to make the library more generic than angstrom or yours in various ways). I would never have thought the performance benefits are so big! I do wonder how type safety works out in practice though.

In the end I only use effects for suspending the state, which allows me to write the combinators in semi-direct style (not CPS, I do have continuation arguments but just to tell the parser what to return not to recurse), but the interface is still monadic. In my experiments this approach only has a small performance advantage over the pure CPS style of angstrom (but then I lose that advantage when I add my fancy type stuff).

1 Like

Yes, I should add Uchar.

With Parseff.positions you have the relative position each time you call, and in combination with diagnostics you can clearly track absolute positioning or line numbers Diagnostics | Parseff. If you have any specific suggestion, I’m all up to consider.

Thanks

I wrote 5 guides about parsing that should give a good tutorial style (they are about parseff, so would be nice to have some basic notions of parsing and not focus in a single lib, but still can be worth)

5 Likes

That’s cool. I don’t think effects are “super performant” but rather Parseff has a bit of more consideration than angstrom. Also, I made some fused operations and spans that help to reduce the wasted iterations.

Would be nice to see another design, I encourage you to finish and publish!

Angstrom has support for parsing binary input, like multi-byte numbers in different endianness. Is the library supporting this use case or is support planned?

Thinking about it a bit more, let’s see if a richer location make sense. Created an issue in Better position tracking · Issue #2 · davesnx/parseff · GitHub

Uchar and Binary could be added

EDIT: Parseff.Utf8 and Parseff.LE/BE have been added in v.0.3.0: binary: BE, LE, fail does `Failure, catch, locations, location_of_position by davesnx · Pull Request #5 · davesnx/parseff · GitHub
with a few more missing bits, look into the .3.0 tag Release 0.3.0 · davesnx/parseff · GitHub and open any issue with feedback or suggestion.

Thanks!

2 Likes

The one think I wasn’t a fan of Parseff.or_ is the error reporting (the backtracking and everyting else is pretty nice thought!):

Since we apply the right parser last, it’s the only error we capture giving a terribly confusing error message, a screenshot of the docs (the maybe tells it’ expected false)

This is technically a missing feature, since at the time we are discontinuing the effect (which jumps into the main error handler) we know the left parser positoion and right parser position, we could join the errors

That’s what’s done in here Compose Expected errors · davesnx/parseff@caa0973 · GitHub and the expected error on “maybe” becomes (* Error { pos = 0; error = Expected “"true" or "false"” } *)`

I will push it as part of 0.3.0

3 Likes

Just out of curiosity, how more generic is your library compared to angstrom and parseff?

I’m also making a parser combinator library, but the interface / set of combinators are mostly based on angstrom, it’s just the driver / internal “engine” that is different (specifically, I’m reimplementing ocaml-asp as an exercise). I feel like the monad Angstrom.t and the operators / combinators around them are quite natural to use already.

I’ve been trying to support arbitrary character types and buffer/string types without losing too much performance. In particular you can have an utf-8 “lexer” and your combinators like take_while will take a Uchar.t -> bool argument. Or you can parse an array of tokens of some custom type.

I’m not sure there is that much of a use for all that but it was more for the challenge of implementing it. I will post it soon :slight_smile:

1 Like

Interesting! For parser combinators the “smallest unit of work” is usually a character (that’s what you get with take_while, peek_char, etc), but with this idea, it’s possible to define your own unit, your own level of abstraction? That would be useful for not only supporting multiple character types as you suggested but for mental modeling as a whole.

Looking forward to your library!

I don’t know what you mean as generic.

Personally I think angstron combinators are ok, but after writing 2 parsers at my work, there’s clearly a barrier on how to understand what the parser is doing, plus a learning wall with the operators.

I designed parseff to be a library that you come back to the parsing logic you made 1 year ago and you instantly understand it everything. That’s not the case for angstrom.

3 Likes

Oh btw, you might also like the ( or ) operator. Eg,

type 'a t = unit -> 'a

val ( or ) : 'a t -> 'a t -> 'a t

Usage:

parser1 or parser2
1 Like

Very cool!

I looked at bench/bench_angstrom.ml and made some modifications to the “fair” benchmark so it is IMO closer to the angstrom benchmark - that is, convert numbers to float while parsing; factor out whitespace skipping; use the unfused sep_by over sep_by_take_span; use the same character predicates (probably insignificant)

diff --git a/bench/bench_angstrom.ml b/bench/bench_angstrom.ml
index 0c26b32..f8c5104 100644
--- a/bench/bench_angstrom.ml
+++ b/bench/bench_angstrom.ml
@@ -1,8 +1,8 @@
 open Benchmark
 
 let json_input = {|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]|}
-let is_digit_or_sign c = (c >= '0' && c <= '9') || c = '-' || c = '.'
-let is_ws c = c = ' ' || c = '\t' || c = '\n' || c = '\r'
+let is_digit_or_sign = function '0' .. '9' | '-' | '.' -> true | _ -> false
+let is_ws = function ' ' | '\t' | '\n' | '\r' -> true | _ -> false
 
 module Parseff_JSON = struct
   let[@inline always] float_of_span (s : Parseff.span) =
@@ -16,9 +16,6 @@ module Parseff_JSON = struct
     else
       float_of_string (Parseff.span_to_string s)
 
-  let[@inline] float_of_span_fair (s : Parseff.span) =
-    float_of_string (Parseff.span_to_string s)
-
   let json_array () =
     Parseff.skip_while_then_char is_ws '[';
     Parseff.skip_while is_ws;
@@ -27,11 +24,17 @@ module Parseff_JSON = struct
     Parseff.skip_while_then_char is_ws ']';
     elements
 
+  let number () =
+    let s = Parseff.take_while is_digit_or_sign in
+    float_of_string s
+
+  let ws () = Parseff.skip_while is_ws
+
   let json_array_fair () =
     Parseff.skip_while_then_char is_ws '[';
-    Parseff.skip_while is_ws;
-    let spans = Parseff.sep_by_take_span is_ws ',' is_digit_or_sign in
-    let elements = List.map float_of_span_fair spans in
+    ws ();
+    let sep () = ws (); ignore (Parseff.char ','); ws () in
+    let elements = Parseff.sep_by number sep () in
     Parseff.skip_while_then_char is_ws ']';
     elements
 
@@ -53,10 +56,10 @@ end
 module Angstrom_JSON = struct
   open Angstrom
 
-  let ws = skip_while (function ' ' | '\t' | '\n' | '\r' -> true | _ -> false)
+  let ws = skip_while is_ws
 
   let number =
-    take_while1 (function '0' .. '9' | '-' | '.' -> true | _ -> false)
+    take_while1 is_digit_or_sign
     >>| float_of_string
 
   let json_array =

I found that I missed the _ *> _ combinator from angstrom a bit - I had to make a local binding for the separator. I also had to be a bit careful where to put (). Overall not a big deal.

With that I found that angstrom and parseff were more comparable in the “fair” case:

Benchmarking Parseff vs Angstrom   
==================================

Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Latencies for 100000 iterations of "Parseff (span)", "Parseff (fair)", "Angstrom" (3 runs):
Parseff (span):  0.03 WALL ( 0.03 usr +  0.00 sys =  0.03 CPU, minor = 185.60 MB, major = 0 B) @ 3268080.66/s (n=100000)
                (warning: too few iterations for a reliable count)
                 0.03 WALL ( 0.03 usr +  0.00 sys =  0.03 CPU, minor = 185.60 MB, major = 0 B) @ 3280947.54/s (n=100000)
                (warning: too few iterations for a reliable count)
                 0.03 WALL ( 0.03 usr +  0.00 sys =  0.03 CPU, minor = 185.60 MB, major = 0 B) @ 3279441.18/s (n=100000)
                (warning: too few iterations for a reliable count)
Parseff (fair):  0.17 WALL ( 0.17 usr +  0.00 sys =  0.17 CPU, minor = 564.80 MB, major = 0 B) @ 594643.45/s (n=100000)
                (warning: too few iterations for a reliable count)
                 0.17 WALL ( 0.17 usr +  0.00 sys =  0.17 CPU, minor = 564.80 MB, major = 0 B) @ 594915.85/s (n=100000)
                (warning: too few iterations for a reliable count)
                 0.17 WALL ( 0.17 usr +  0.00 sys =  0.17 CPU, minor = 564.80 MB, major = 0 B) @ 593884.18/s (n=100000)
                (warning: too few iterations for a reliable count)
      Angstrom:  0.14 WALL ( 0.14 usr +  0.00 sys =  0.14 CPU, minor = 584.00 MB, major = 0 B) @ 710767.42/s (n=100000)
                (warning: too few iterations for a reliable count)
                 0.14 WALL ( 0.14 usr +  0.00 sys =  0.14 CPU, minor = 584.00 MB, major = 0 B) @ 712565.38/s (n=100000)
                (warning: too few iterations for a reliable count)
                 0.14 WALL ( 0.14 usr +  0.00 sys =  0.14 CPU, minor = 584.00 MB, major = 0 B) @ 719393.41/s (n=100000)
                (warning: too few iterations for a reliable count)

                    Rate       Parseff (fair)       Angstrom Parseff (span)
Parseff (fair)  594481+- 415/s             --           -17%           -82%
      Angstrom  714242+-3530/s            20%             --           -78%
Parseff (span) 3276156+-5456/s           451%           359%             --

Note: Lower latency is better.

So the “fair” implementation is now a bit slower than angstrom, but not by that much. The memory usage is also comparable to angstrom. Still very cool because you have access to the fused stuff if you need to tweak for performance.

2 Likes

After having tried some parser combinator libraries, I think there should be multiple styles of writing / using combinators. For example, in Parsus, a parser combinator library for Kotlin lang, one can write and use combinators in either “combinator style”, or “procedural style”. In Angstrom you can also do something similar: for example, you have the choice to use let+ bindings, or you can use bind operators >>=. Personally I usually write only small hobby code bases, I really like using the operators. But I imagine that as my grammar increases in complexity and as I maintain more and more parsers.. there will be a growing need for simpler “scalable” syntax.