High-performance lexing in OCaml

I’m in the process of writing parser for a YAML-like language, that extends JSON, and is specified in two layers: a lexical layer, and a grammatical layer. The current implementation uses sedlex + camlp5 extensible grammars. Obviously performance is really important for this sort of thing: I used these particular libraries b/c they were available and efficient for programmer productivity, and am under no illusions that they’re the ne plus ultra of performance.

So I’m thinking about how to write a high-performance version, that would do as much as possible to avoid memory-allocation, copies, etc. I already know that I can switch to recursive-descent for the grammar (initially using Camlp5 stream-parsers) but I’m thinking about ways to do better for the lexer:

  1. Probably switch to ocamllex, so that instead of converting a UTF-8 stream to UCS-2 glyphs in an array, and then converting back, we just keep things in UTF-8. This seems like it should be a win.
  2. Yojson has some tricks to avoid allocations, and I’m going to try to figure how to benefit from those.

But I wondered if others had ideas for how to wring the absolute most performance out of ocamllex. B/c after that, the next step is to figure out how to write a lower-level lexer, and … that seems like a much more … fraught exercise. I’ve wondered if maybe there’s a way to use regular expressions, maybe somehow hacking the regexp engine to make processing N alternative regexps as efficient as processing a single one. But that’s really just random musing.

Anybody with any advice, would be greatly appreciated.

I think the yojson hack you mention was found by accident. I don’t think I copied it from somewhere else, but rather by looking at the generated code and guessing that allocating at high frequency would make things slow.

To make parsers fast at the expense of using more memory, tuning the GC settings with Gc.set could work well. Of course it could also hurt the application in which parsing happens. An idea is to provide a function make_faster: unit -> unit, which the user can try to speed things up. Here’s what we use in semgrep:

  Gc.set {
    (Gc.get()) with
    Gc.space_overhead = 300;
    Gc.minor_heap_size = 4_000_000;
    Gc.major_heap_increment = 8_000_000

There is a chance that if you put lexing and parsing to parallel threads then overall parsing performance will be better

Yes, I’ve found that while other kinds of performance optimization are important, to a first approximation you can just (just) profile for allocation-rate, and working to remove allocation hotspots will do a ton for perf.

I have a vague memory of comparing perf of ocamllex against pcre at some point, and finding that the latter was faster. But memory escapes me, ah well.

@Kakadu : I’m more looking for ways to decrease instruction-count, not ways to employ parallelism (though, since this is compute-bound near-100% ocaml code, I doubt that multithreading would speed it up). [by compute-bound, I mean that, I’m excluding I/O, assuming that the problem is to parse from already-read memory-buffers: the problem of I/O optimization is a separate problem.]

P.S. Your point about increasing heap-sizes is well-taken, but I’m eschewing such methods, b/c they can always be applied after other optimizations have been done. I want to find out what can be done just in the code.

The ocamllex -ml option will generate the lexing automaton as OCaml code (instead of a data table interpreted by the runtime). In the past it was not recommended / not the default as generally the normal mode performs better and is less likely to make the compiler blow in your face, but since OCaml 4.08 (@alainfrisch’s upstream PR #1585) it was optimized and is reported to be significantly faster (than the default backend) in some cases, see the benchmarks at the time. I guess you should try it.

Both ocamllex mode have a trick to not update lex_curr_p, which I think corresponds to the yojson hack you mentioned – this trick should not be necessary anymore for recent ocamllex versions. If I understand correctly, if lex_curr_p is set to Lexing.dummy_pos then it will not be updated.

1 Like

Ideally we would have a fast sedlex implementation, so that we would not have to decide between speed and nice unicode handling. You mention that it converts the input to an intermediate glyph array, but this is probably just a simplification that could be avoided. Is anyone interested in optimizing sedlex? The ocamllex -ml performances demonstrate that generated OCaml code can equal or surpass the default ocamllex performance, so sedlex could do it too!

If the lexical structure of your language is only driven by ASCII characters, you could indeed do the lexical analysis directly on the input byte stream. (And then, if you don’t trust the input to be valid utf-8, check the well-formedness of strings, etc.)

How sensible is it to use Angstrom to build lexer and parser in one formalism? I have used Angstrom to parse a somewhat complex binary protocol but I can see it being used for any grammar that can be parsed by recursive descent. Yacc grammars (LR) are more expressive but maybe that expressiveness is not needed.

The examples in Angstrom include a JSON parser that also handles UTF8.

1 Like

TL;DR Parser combinators appear to me to be “combinators for LL(1) parsing”. That leaves all the work of converting a pile of regexps into a backtracking-free combinator-function-nest. And it isn’t at all clear that performance will be better.

I don’t really know. But I doubt it. Here’s why:

  1. the traditional divide between lexing and parsing concentrates the need for backtracking to deal with ambiguity into the lexer. There, it is dealt-with by precomputing the lexing automaton, so that at runtime, there is no backtracking.
  2. parser-combinators either just use backtracking, or require the user to do the work of converting a collection of regexps that are not LL(1) (in the sense that a single character of input does not suffice to choose which one to go with) into a parser-combinator-nest that is backtracking-free.
  3. A good example that I experienced recently was the Camlp5 “pa_lexer” syntax extension: it is designed to write lexers, but without backtracking and also without the lookahead-without-consuming capability of lex. It’s significantly more complicated to write a complex lexing specification in that language than in plain old lex.
  4. It isn’t clear that performance would be better with parser-combinators, but hey, maybe it would. I searched around for performance comparisons, and what I find are:
  • comparisons for simple examples between combinators and other methods, but not regexps or lex
  • after that, only comparisons between different combinator libraries.
  1. I’ve experimented [on a problem posed here by @rwmjones – it was a problem around parsing a syntax for expressing disk-contents] with solving a nontrivial parsing problem using Camlp5’s stream-parsers, and eventually realized that the fastest, fastest, FASTEST technique (which also was the simplest) was to build a big-ass regexp and hand that to pcre, let it do its magic. He started with a Perl script, and wanted something faster. When it turned out that almost all the work could be done in a big-ass regexp, it seemed pretty clear to me that while sure, OCaml was nice, if it were me, I’d just implement that regexp in Perl and take the win. B/c typically Perl is pretty fast at string-processing, esp. when most of the work is in regexps.

It’s very difficult to assess whether parser combinators are suitable for actual “industrial strength” parsing, where that is defined as “the incumbent is C++ flex/bison”. Not to speak of more esoteric approaches in C/C++.

Overall, from what I can see, parser combinators are just recursive-descent parsing dressed-up in combinators. That’s not bad: I’m a gynormous fan of Camlp5’s stream parsers. But I wouldn’t use them for writing a lexer, b/c boy howdy, that’s unpleasant work compared to writing a pile of regexps and stepping back while the lex automaton algorithm does its work.

P.S. I hope it’s clear that I’m not saying parser-combinators aren’t useful. I use them (in the form of Camlp5 stream-parsers) all the time for parsing binary protocols, e.g. Java classfile format, CORBA IIOP, etc. A distinguishing feature of RPC protocols, is that they are all LL(1) languages at the character level. So parser-combinators (or Camlp5 sream parsers) are perfect.

If the grammar you have in mind extends JSON, then you should be able to do a reasonably representative comparison by just parsing JSON inputs. For example you could compare what you already have with angstrom’s JSON parser, both in terms of performance and code clarity, and possibly also experiment with other parsing techniques (or local changes like replacing sedlex by ocamllex) just on the JSON fragment.

1 Like

The “language” is YAML, but with some changes to allow precise lexing/parsing (for those who have some familiarity with the YAML spec, it shouldn’t be surprising to learn that it’s pretty difficult to write a YAML parser using lex/yacc in the typical way: this results in basically every YAML parser being different in spec-compliance, from every other one). So while this language has a subset JSON, it’s quite, quite different b/c there’s another subset that is indentation-sensitive, etc, etc.

I will look into the benchmarks you mention that show lexing in ML code to be faster than doing it with the C engine: that should be a good start. At the same time, I guess it would be useful to look into PCRE, since that’s been so … shockingly fast in the past.

P.S. Purely for amusement’s sake, here’s the repo for the implementation. It isn’t complete, and currently relies on Camlp5’s extensible grammars (but that’ll go away (in the interest of performance). The “specification” is incomplete, but I’m working on it.

I am currently working on a update of menhir with the goal of improving the generated code, including performance. I have no idea how it compares to other parsers though. If at one point you make a benchmark, I would really like to compare different versions of menhir to any other option you would think could yield good results.

1 Like

It’s a funny thing: I don’t know much about the variation in performance of parser generators or parsing algorithms. I’ve heard that some are faster than others, but have always thought that lexing is much more determinative of performance, b/c in lexing there are many branches and loads, for a single parsing-automaton action. I guess I could be wrong, and need to look into it.

Do you have any pointers to people who have done performance studies of parsing methods and generated parsers? I suppose it makes sense to compare them in isolation (somehow, synthetically generating the lexemes, so lexer efficiency doesn’t affect parsing), and then in tandem with lexing ?

Is this making sense ? We all know that subfields of software can be quite specialized, so perhaps I’m just off-base, and in fact parsing automaton/algorithm efficiency is really important.

P.S. Setting aside the case where something about the lexemes makes comparing their classes to be less efficient, of course. That could affect parser efficiency, for sure.

I’m surprised about this hypothetic performance discussion :–)

It seems you already have an implementation. Did you actually measure to try to identify a bottleneck ? I can’t see that mentioned in your first message.

1 Like

ocamllex and sedlex (and any serious lexer generator) already do that. You can even do it with re. :slight_smile:

Lexing is really not the bottleneck in lexing+parsing. You can think about it that way: the more complicated the grammar you are trying the recognize, the slower it is. Lexing is trying to recognize regular grammars, which are the simplest and most heavily optimized class of grammar that is commonly analyzed. Parsing afterwards is trying to recognize something more complicated, and is often slower by several orders of magnitude.

Also, I would be (pleasantly) surprised if camlp5 stream parsers were not horribly slow compared to menhir, or any other parser generator.

Heh indeed. I have an implementation using sedlex and Camlp5 “extensible grammars” (which are interpreted). I would guess there is inefficiency in sedlex as compared with ocamllex, but don’t really know. For sure Camlp5 extensible grammars are less efficient than Camlp5 stream parsers, and I’m not going to be surprised of those are less efficient than menhir. I would call this implementation a “reference parser”, designed for ensuring that I have something perspicuous to go along with the specification.

So it’s fair to say that I’m asking these questions, in preparation for starting to write efficient implementations.

P.S. Though, I got sidetracked into implementing a “jq” interpreter. That’s a fascinating language ( GitHub - stedolan/jq: Command-line JSON processor ): it’s a functional language for manipulating JSON objects, whose critical datatype is lazy lists. The guy who did it wrote it all in C, generating bytecode. Very impressive bit of work there. I’m just writing an interpreter in OCaml, which is … a whole lot less, ha. After I’m done, I’m going to write a compiler, b/c “why not? It’s not much more work!”

I think you can address your admiration to @stedolan personally. :slight_smile:

I do not really know (yet ?) whether performance studies exists for parser generators. I am aware of this paper from the 90s and that is pretty much all : http://webhome.cs.uvic.ca/~nigelh/Publications/fastparse.pdf

However, menhir already has benchmarks, and yes, we perform the lexing first, put all the lexemes in an array, and create a fake lexer that just accesses the array. I am not really sure how not doing that would impact performance, but I could check quite easily.

I was about to suggest that instead of your doing that experiment, I would go do it myself. B/c hey, good practice! But grepping around in the menhir repo, I don’t find anything that looks like a benchmark.

If you could point me at the code and how to run it, I’d be happy to give a try at doing what you described. Or, at some point I’ll write a Menhir parser (have an ocamllex lexer) and I’ll be able to do the experiment myself.

The makefile rule to build the benchmark is “make speed”. I am not sure whether this is present in master, but it is for sure in the branch “new-code-backend-experiments”. The code is in “tests/dynamic/speed” and “tests/dynamic/speed_houblix” (the first one is a benchmark on integer arithmetic expression, the second one on a more complex ML-ish langage).