Lwt core refactored and documented to be contributor-friendly

I’m answering with a pretty thorough look at how variants are implemented. Hopefully, it’s still accessible :slight_smile:

The short version is that matching polymorphic variants is, surprisingly, faster. Allocating them is slower. But the differences are slight. They don’t matter to Lwt. They probably don’t matter to most other projects, either.


What does this have to do with GADTs? At runtime, GADTs are the same as ordinary variants. So, I’ll talk about ordinary variants instead. I’ll also assume a 64-bit platform, so words are 64 bits long.


I’ll first describe how ordinary variants are represented and pattern-matched. Then, I’ll describe the relevant differences of polymorphic variants, and measure them. To preview, those differences are:

  1. Polymorphic variants are represented as sparse integer ranges, instead of dense ones.
  2. Slightly more memory is allocated for polymorphic variants.

I did this in part for my own education. It’s a recurring question, though, so I decided to write it up :slight_smile: I put commented assembly and benchmark source code into a gist. There is also some BuckleScript output.


How ordinary variants are represented in memory

Let’s look at this variant type:

type ordinary =
  | A | B | C | D
  | E of int | F of int | G of int | H of int

OCaml separates its constructors into two lists: constructors without arguments (A, B, C, D), and constructors with arguments (E, F, G, H). The constructors in each list are numbered starting from zero. Names don’t matter.

A -> 0   B -> 1   C -> 2   D -> 3   (* no arguments *)
E -> 0   F -> 1   G -> 2   H -> 3   (* take arguments *)

Construtors without arguments, like D, become unboxed integers, which are single machine words that fit into a register.

Aside about OCaml integers:

The lowest bit of an unboxed integer is set to 1. The actual value is stored in the remaining 63 upper bits. This is to distinguish integers from pointers, where the lowest bit is set to 0. The GC needs this. It’s part of how it knows what is a pointer that it needs to follow.

So, the integer 3 is represented as (3 << 1) | 1, which is 7.

With this aside in mind, the no-argument constructors’ final machine representations are 1, 3, 5, and 7.


Constructors with arguments, like G 10, are represented as pointers to memory blocks (so the lowest bit will be 0). Following the pointer leads to a block like this:

[ block header ] [      21      ]

21 is the OCaml representation of 10. The block header contains information like the size of the block. It also contains a few bits for a tag field. In this case, the tag will be 2, the number for G.

If the constructor has two arguments, there will be three words in the block:

[ block header ] [      21      ] [      85      ]

Matching ordinary variants

So, to perform a match, the generated code has to:

  1. Determine whether it is looking at an unboxed integer (a constructor with no arguments), or a pointer (a constructor with arguments). This is done by examining the lowest bit of the word.
  2. If an unboxed integer, pick a case according to the high 63 bits.
  3. If a pointer, dereference, extract the constructor number from the tag field in the block header, and use that to pick the case.

What’s important is that in both (2) and (3), there is a dense integer range, from zero to the number of constructors of that kind. So, the match case can be picked using a jump table: pointers to the code for each case are stored in an array. The constructor number is used as an index into that array. The processor jumps to the pointer stored at the proper index. This jump-by-computed pointer is called an indirect jump. Because of the random-access table lookup, the performance of the jump table is largely unaffected by the number of constructors.


Polymorphic variant constructors are sparse

This leads to the first difference of polymorphic variants: their constructors don’t get a dense number range like 0, 1, 2, 3. Instead, their names are hashed. `Foo becomes 3,505,894 and `Bar becomes 3,303,859. This is a sparse range. You wouldn’t want to have a jump table. Between the entries for `Foo` and `Bar, there would be over 200,000 unused indexes!

Instead, matching polymorphic variants is compiled to the equivalent of a bunch of nested if-then statements. To keep the number of tests low, they are arranged as a binary search. This means:

  1. Pattern matching a polymorphic variant involves a number of jumps that is proportional to the log of the number of cases. We can expect matching to be slower for larger variant types. In contrast, we expect matching ordinary variants to have the same performance no matter the size.
  2. It’s still not clear which one is slower. On the one hand, doing multiple jumps during a binary search sounds bad. On the other hand, the jumps are to hardcoded labels. A single indirect, computed jump looked up from a jump table tends to be much slower.

So, I created a benchmark. Here are the results (EDIT: see likely more accurate results prompted by @spyder):

matching


As you can see, polymorphic variants (rather, binary search) turned out to be a lot faster. The β€œcontrol” test performs all the overhead of testing, but doesn’t have a match expression. After subtracting the control from each measurement, it looks like polymorphic variants are up to 15 times faster than jump tables. The raw data is here.

However, this is highly dependent on the processor architecture and compiler version. There is no reason why ordinary variants can’t be compiled to a binary search.

Also, sometimes, the compiler can convert a jump table into a data lookup table (for example, on 4.05, if each case immediately returns a value). This is about as fast as the control, and still insensitive to the number of cases.


Polymorphic variants can use more memory

Allocating polymorphic variants takes longer, and using more memory brings the next garbage collection cycle closer in time.

There are three cases here:

  • Polymorphic variant constructors without arguments. These are represented as single unboxed integers, the same as ordinary variants (though the values are quite different!). Memory usage is the same.

  • Constructors with one argument. These are always represented as three words:

    [ block header ] [     hash     ] [   argument   ]
    

    That’s one more word than an ordinary variant! The difference is that the constructor hash is stored in that extra middle word, instead of being stuffed into the tag field of the block header.

  • Constructors with two or more arguments. These are represented as two separate blocks. The first points to the second, and the second is a tuple of all the arguments:

    [ block header ] [     hash     ] [   pointer    ]
    [ block header ] [   argument   ] [   argument   ]
    

    This is now four words of overhead, while overhead of ordinary variants is only one word.


I wrote another benchmark to look at the relative cost of allocating polymorphic variants:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Name                β”‚ Time/Run β”‚ mWd/Run β”‚ Percentage β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ control,            β”‚   2.74ns β”‚         β”‚     38.73% β”‚
β”‚ ordinary, 1 arg     β”‚   3.81ns β”‚   2.00w β”‚     63.63% β”‚
β”‚ polymorphic, 1 arg  β”‚   4.50ns β”‚   3.00w β”‚     75.14% β”‚
β”‚ ordinary, 2 args    β”‚   4.34ns β”‚   3.00w β”‚     72.60% β”‚
β”‚ polymorphic, 2 args β”‚   5.98ns β”‚   6.00w β”‚    100.00% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polymorphic variants are clearly slower here. In the two-argument case, allocation is twice as slow as for ordinary variants, after the control is subtracted. This is probably the worst showing for polymorphic variants, as two arguments is the case with the greatest amount of relative overhead.


What about BuckleScript?

BuckleScript emits switch for ordinary variants, and binary search for polymorphic variants. It should probably just emit switch for both, and let the JS engine figure out the best way to execute it. However, I haven’t evaluated the performance, so this is just a guess.


So, which one is better for Lwt?

If making the decision based on performance only, I don’t know :slight_smile:

Lwt is inherently used for the I/O-bound fragment of each program, so it probably doesn’t matter.

Any program is likely to have all sorts of costs not related to which kind of variants are being used. Those will probably dominate the few nanoseconds that can be saved by choosing one variant kind over another.

Concerning ordinary variants, when the actual source-code variant is a GADT, pattern matching might be eliminated completely through really precise typing. This can have yet another tiny performance effect.

For Lwt, considerations other than variant performance ultimately control the choice of variant kind.

14 Likes