Iβm answering with a pretty thorough look at how variants are implemented. Hopefully, itβs still accessible
The short version is that matching polymorphic variants is, surprisingly, faster. Allocating them is slower. But the differences are slight. They donβt matter to Lwt. They probably donβt matter to most other projects, either.
What does this have to do with GADTs? At runtime, GADTs are the same as ordinary variants. So, Iβll talk about ordinary variants instead. Iβll also assume a 64-bit platform, so words are 64 bits long.
Iβll first describe how ordinary variants are represented and pattern-matched. Then, Iβll describe the relevant differences of polymorphic variants, and measure them. To preview, those differences are:
- Polymorphic variants are represented as sparse integer ranges, instead of dense ones.
- Slightly more memory is allocated for polymorphic variants.
I did this in part for my own education. Itβs a recurring question, though, so I decided to write it up I put commented assembly and benchmark source code into a gist. There is also some BuckleScript output.
How ordinary variants are represented in memory
Letβs look at this variant type:
type ordinary =
| A | B | C | D
| E of int | F of int | G of int | H of int
OCaml separates its constructors into two lists: constructors without arguments (A
, B
, C
, D
), and constructors with arguments (E
, F
, G
, H
). The constructors in each list are numbered starting from zero. Names donβt matter.
A -> 0 B -> 1 C -> 2 D -> 3 (* no arguments *)
E -> 0 F -> 1 G -> 2 H -> 3 (* take arguments *)
Construtors without arguments, like D
, become unboxed integers, which are single machine words that fit into a register.
Aside about OCaml integers:
The lowest bit of an unboxed integer is set to 1. The actual value is stored in the remaining 63 upper bits. This is to distinguish integers from pointers, where the lowest bit is set to 0. The GC needs this. Itβs part of how it knows what is a pointer that it needs to follow.
So, the integer
3
is represented as(3 << 1) | 1
, which is7
.
With this aside in mind, the no-argument constructorsβ final machine representations are 1
, 3
, 5
, and 7
.
Constructors with arguments, like G 10
, are represented as pointers to memory blocks (so the lowest bit will be 0). Following the pointer leads to a block like this:
[ block header ] [ 21 ]
21
is the OCaml representation of 10
. The block header contains information like the size of the block. It also contains a few bits for a tag field. In this case, the tag will be 2
, the number for G
.
If the constructor has two arguments, there will be three words in the block:
[ block header ] [ 21 ] [ 85 ]
Matching ordinary variants
So, to perform a match
, the generated code has to:
- Determine whether it is looking at an unboxed integer (a constructor with no arguments), or a pointer (a constructor with arguments). This is done by examining the lowest bit of the word.
- If an unboxed integer, pick a case according to the high 63 bits.
- If a pointer, dereference, extract the constructor number from the tag field in the block header, and use that to pick the case.
Whatβs important is that in both (2) and (3), there is a dense integer range, from zero to the number of constructors of that kind. So, the match
case can be picked using a jump table: pointers to the code for each case are stored in an array. The constructor number is used as an index into that array. The processor jumps to the pointer stored at the proper index. This jump-by-computed pointer is called an indirect jump. Because of the random-access table lookup, the performance of the jump table is largely unaffected by the number of constructors.
Polymorphic variant constructors are sparse
This leads to the first difference of polymorphic variants: their constructors donβt get a dense number range like 0, 1, 2, 3. Instead, their names are hashed. `Foo
becomes 3,505,894 and `Bar
becomes 3,303,859. This is a sparse range. You wouldnβt want to have a jump table. Between the entries for `Foo`
and `Bar
, there would be over 200,000 unused indexes!
Instead, matching polymorphic variants is compiled to the equivalent of a bunch of nested if-then statements. To keep the number of tests low, they are arranged as a binary search. This means:
- Pattern matching a polymorphic variant involves a number of jumps that is proportional to the log of the number of cases. We can expect matching to be slower for larger variant types. In contrast, we expect matching ordinary variants to have the same performance no matter the size.
- Itβs still not clear which one is slower. On the one hand, doing multiple jumps during a binary search sounds bad. On the other hand, the jumps are to hardcoded labels. A single indirect, computed jump looked up from a jump table tends to be much slower.
So, I created a benchmark. Here are the results (EDIT: see likely more accurate results prompted by @spyder):
As you can see, polymorphic variants (rather, binary search) turned out to be a lot faster. The βcontrolβ test performs all the overhead of testing, but doesnβt have a match
expression. After subtracting the control from each measurement, it looks like polymorphic variants are up to 15 times faster than jump tables. The raw data is here.
However, this is highly dependent on the processor architecture and compiler version. There is no reason why ordinary variants canβt be compiled to a binary search.
Also, sometimes, the compiler can convert a jump table into a data lookup table (for example, on 4.05, if each case immediately returns a value). This is about as fast as the control, and still insensitive to the number of cases.
Polymorphic variants can use more memory
Allocating polymorphic variants takes longer, and using more memory brings the next garbage collection cycle closer in time.
There are three cases here:
-
Polymorphic variant constructors without arguments. These are represented as single unboxed integers, the same as ordinary variants (though the values are quite different!). Memory usage is the same.
-
Constructors with one argument. These are always represented as three words:
[ block header ] [ hash ] [ argument ]
Thatβs one more word than an ordinary variant! The difference is that the constructor hash is stored in that extra middle word, instead of being stuffed into the tag field of the block header.
-
Constructors with two or more arguments. These are represented as two separate blocks. The first points to the second, and the second is a tuple of all the arguments:
[ block header ] [ hash ] [ pointer ] [ block header ] [ argument ] [ argument ]
This is now four words of overhead, while overhead of ordinary variants is only one word.
I wrote another benchmark to look at the relative cost of allocating polymorphic variants:
βββββββββββββββββββββββ¬βββββββββββ¬ββββββββββ¬βββββββββββββ
β Name β Time/Run β mWd/Run β Percentage β
βββββββββββββββββββββββΌβββββββββββΌββββββββββΌβββββββββββββ€
β control, β 2.74ns β β 38.73% β
β ordinary, 1 arg β 3.81ns β 2.00w β 63.63% β
β polymorphic, 1 arg β 4.50ns β 3.00w β 75.14% β
β ordinary, 2 args β 4.34ns β 3.00w β 72.60% β
β polymorphic, 2 args β 5.98ns β 6.00w β 100.00% β
βββββββββββββββββββββββ΄βββββββββββ΄ββββββββββ΄βββββββββββββ
Polymorphic variants are clearly slower here. In the two-argument case, allocation is twice as slow as for ordinary variants, after the control is subtracted. This is probably the worst showing for polymorphic variants, as two arguments is the case with the greatest amount of relative overhead.
What about BuckleScript?
BuckleScript emits switch
for ordinary variants, and binary search for polymorphic variants. It should probably just emit switch
for both, and let the JS engine figure out the best way to execute it. However, I havenβt evaluated the performance, so this is just a guess.
So, which one is better for Lwt?
If making the decision based on performance only, I donβt know
Lwt is inherently used for the I/O-bound fragment of each program, so it probably doesnβt matter.
Any program is likely to have all sorts of costs not related to which kind of variants are being used. Those will probably dominate the few nanoseconds that can be saved by choosing one variant kind over another.
Concerning ordinary variants, when the actual source-code variant is a GADT, pattern matching might be eliminated completely through really precise typing. This can have yet another tiny performance effect.
For Lwt, considerations other than variant performance ultimately control the choice of variant kind.