Lwt core refactored and documented to be contributor-friendly

Sorry for the late edit. You’re a good, prolific writer, and I should have realized you were busy reacting to my comment and just gone ahead with a new post rather than editing.

Changing a phantom type parameter certainly seems more innocuous than re-interpreting the entire representation of a type, and since phantom types don’t inherently mean anything to the compiler, I can see it being one of the more justified uses of Obj.magic, so I guess I’m convinced.

2 Likes

I look at this as someone who spent a good bit of time debugging segfaults when trying to upgrade the compiler, which turned out to be caused by uses of Obj.magic in a third-party library that used to be “ok” but the compiler had just gotten smart enough to catch the lie. From this point of view, I’m very wary of taking dependencies on libraries that use Obj.magic, and in cases where there are compelling reasons to use such libraries despite that, I find myself looking at every occurrence and trying to reconstruct the safety argument. Maybe I’m paranoid, but I figure that it’s a position that might be helpful for library authors to hear. That is, I think it would be good if whenever Obj.magic was used, the library author thought that every user of the library was going to read the code and reconstruct the safety argument.

With that said, I wonder if there would be significant value in some mechanism akin to Obj.magic that only allowed casting between types that the type checker can prove have the same representation. I’m not sure how feasible such a feature would be, but handling the particular case of magicking a phantom type does not seem too far out (although not expressible as a type of a value in the current system unless I’m confused).

1 Like

No worries about the post :slight_smile: I just wanted to be clear to future readers, that I didn’t just arbitrarily ignore like 95% of what became the post :slight_smile:

I’m definitely paranoid about the Obj.magic, and it’s good to bring it to attention – especially if a side effect is to get good looks from experts in Flambda.

This also made me think about our testing harder, and I realized that we probably don’t actually test on Flambda in CI at this point. Certainly, we don’t test enough. I’ve only tested manually in the past, and build rows aren’t actually doing it (AFAICT). I’ve opened an issue about that; thanks for the prod.

I agree with this in general.

For Lwt specifically, I can briefly describe the Obj.magics here. Hopefully they already are, or will be, well-documented in the new lwt.ml (that was the point of rewriting it!). To the best of my recollection, all the casts fall into one of three categories:

  1. Adjusting phantom type parameters. These are new.
  2. The types 'a Lwt.t and 'a Lwt.u actually have no definition inside lwt.ml or anywhere at all. What happens is that they are Obj.magicked to an internal promise type when passed to an API, and back the other way when returned from an API. This was the case in the old lwt.ml, and still is in the new. This is done because of that variance business I briefly mentioned, though there are other ways to address that.
  3. IIRC there is one place where lists of 'a Lwt.t are cast to lists of the internal promise type, without bothering to rebuild the lists, as would be done by calling map and applying the casts from (2) one element at a time. This is new.

We’ll watch out for these casts becoming a problem or a burden, and get rid of them when the tradeoff is no longer worth it. Input on that tradeoff is very welcome.

Thanks for the explanations, that helps. But it is one thing to argue that, basically, values that are cast are always cast back to their true type, and another thing to argue that the compiler treats all the types any given value may have or be cast to in the same way. For instance, casts between types where mutability, laziness, being a float, etc. change are pretty difficult to be confident about I find, as it requires understanding the type-specific optimizations in the compiler and run time system.

2 Likes

I’ve just used the polymorphic variant type coercion in my application. Thanks for helping me to discover simple subtyping in OCaml!

2 Likes

Are there performance implications with using polymorphic variants?

1 Like

I’m answering with a pretty thorough look at how variants are implemented. Hopefully, it’s still accessible :slight_smile:

The short version is that matching polymorphic variants is, surprisingly, faster. Allocating them is slower. But the differences are slight. They don’t matter to Lwt. They probably don’t matter to most other projects, either.


What does this have to do with GADTs? At runtime, GADTs are the same as ordinary variants. So, I’ll talk about ordinary variants instead. I’ll also assume a 64-bit platform, so words are 64 bits long.


I’ll first describe how ordinary variants are represented and pattern-matched. Then, I’ll describe the relevant differences of polymorphic variants, and measure them. To preview, those differences are:

  1. Polymorphic variants are represented as sparse integer ranges, instead of dense ones.
  2. Slightly more memory is allocated for polymorphic variants.

I did this in part for my own education. It’s a recurring question, though, so I decided to write it up :slight_smile: I put commented assembly and benchmark source code into a gist. There is also some BuckleScript output.


How ordinary variants are represented in memory

Let’s look at this variant type:

type ordinary =
  | A | B | C | D
  | E of int | F of int | G of int | H of int

OCaml separates its constructors into two lists: constructors without arguments (A, B, C, D), and constructors with arguments (E, F, G, H). The constructors in each list are numbered starting from zero. Names don’t matter.

A -> 0   B -> 1   C -> 2   D -> 3   (* no arguments *)
E -> 0   F -> 1   G -> 2   H -> 3   (* take arguments *)

Construtors without arguments, like D, become unboxed integers, which are single machine words that fit into a register.

Aside about OCaml integers:

The lowest bit of an unboxed integer is set to 1. The actual value is stored in the remaining 63 upper bits. This is to distinguish integers from pointers, where the lowest bit is set to 0. The GC needs this. It’s part of how it knows what is a pointer that it needs to follow.

So, the integer 3 is represented as (3 << 1) | 1, which is 7.

With this aside in mind, the no-argument constructors’ final machine representations are 1, 3, 5, and 7.


Constructors with arguments, like G 10, are represented as pointers to memory blocks (so the lowest bit will be 0). Following the pointer leads to a block like this:

[ block header ] [      21      ]

21 is the OCaml representation of 10. The block header contains information like the size of the block. It also contains a few bits for a tag field. In this case, the tag will be 2, the number for G.

If the constructor has two arguments, there will be three words in the block:

[ block header ] [      21      ] [      85      ]

Matching ordinary variants

So, to perform a match, the generated code has to:

  1. Determine whether it is looking at an unboxed integer (a constructor with no arguments), or a pointer (a constructor with arguments). This is done by examining the lowest bit of the word.
  2. If an unboxed integer, pick a case according to the high 63 bits.
  3. If a pointer, dereference, extract the constructor number from the tag field in the block header, and use that to pick the case.

What’s important is that in both (2) and (3), there is a dense integer range, from zero to the number of constructors of that kind. So, the match case can be picked using a jump table: pointers to the code for each case are stored in an array. The constructor number is used as an index into that array. The processor jumps to the pointer stored at the proper index. This jump-by-computed pointer is called an indirect jump. Because of the random-access table lookup, the performance of the jump table is largely unaffected by the number of constructors.


Polymorphic variant constructors are sparse

This leads to the first difference of polymorphic variants: their constructors don’t get a dense number range like 0, 1, 2, 3. Instead, their names are hashed. `Foo becomes 3,505,894 and `Bar becomes 3,303,859. This is a sparse range. You wouldn’t want to have a jump table. Between the entries for `Foo` and `Bar, there would be over 200,000 unused indexes!

Instead, matching polymorphic variants is compiled to the equivalent of a bunch of nested if-then statements. To keep the number of tests low, they are arranged as a binary search. This means:

  1. Pattern matching a polymorphic variant involves a number of jumps that is proportional to the log of the number of cases. We can expect matching to be slower for larger variant types. In contrast, we expect matching ordinary variants to have the same performance no matter the size.
  2. It’s still not clear which one is slower. On the one hand, doing multiple jumps during a binary search sounds bad. On the other hand, the jumps are to hardcoded labels. A single indirect, computed jump looked up from a jump table tends to be much slower.

So, I created a benchmark. Here are the results (EDIT: see likely more accurate results prompted by @spyder):

matching


As you can see, polymorphic variants (rather, binary search) turned out to be a lot faster. The “control” test performs all the overhead of testing, but doesn’t have a match expression. After subtracting the control from each measurement, it looks like polymorphic variants are up to 15 times faster than jump tables. The raw data is here.

However, this is highly dependent on the processor architecture and compiler version. There is no reason why ordinary variants can’t be compiled to a binary search.

Also, sometimes, the compiler can convert a jump table into a data lookup table (for example, on 4.05, if each case immediately returns a value). This is about as fast as the control, and still insensitive to the number of cases.


Polymorphic variants can use more memory

Allocating polymorphic variants takes longer, and using more memory brings the next garbage collection cycle closer in time.

There are three cases here:

  • Polymorphic variant constructors without arguments. These are represented as single unboxed integers, the same as ordinary variants (though the values are quite different!). Memory usage is the same.

  • Constructors with one argument. These are always represented as three words:

    [ block header ] [     hash     ] [   argument   ]
    

    That’s one more word than an ordinary variant! The difference is that the constructor hash is stored in that extra middle word, instead of being stuffed into the tag field of the block header.

  • Constructors with two or more arguments. These are represented as two separate blocks. The first points to the second, and the second is a tuple of all the arguments:

    [ block header ] [     hash     ] [   pointer    ]
    [ block header ] [   argument   ] [   argument   ]
    

    This is now four words of overhead, while overhead of ordinary variants is only one word.


I wrote another benchmark to look at the relative cost of allocating polymorphic variants:

┌─────────────────────┬──────────┬─────────┬────────────┐
│ Name                │ Time/Run │ mWd/Run │ Percentage │
├─────────────────────┼──────────┼─────────┼────────────┤
│ control,            │   2.74ns │         │     38.73% │
│ ordinary, 1 arg     │   3.81ns │   2.00w │     63.63% │
│ polymorphic, 1 arg  │   4.50ns │   3.00w │     75.14% │
│ ordinary, 2 args    │   4.34ns │   3.00w │     72.60% │
│ polymorphic, 2 args │   5.98ns │   6.00w │    100.00% │
└─────────────────────┴──────────┴─────────┴────────────┘

Polymorphic variants are clearly slower here. In the two-argument case, allocation is twice as slow as for ordinary variants, after the control is subtracted. This is probably the worst showing for polymorphic variants, as two arguments is the case with the greatest amount of relative overhead.


What about BuckleScript?

BuckleScript emits switch for ordinary variants, and binary search for polymorphic variants. It should probably just emit switch for both, and let the JS engine figure out the best way to execute it. However, I haven’t evaluated the performance, so this is just a guess.


So, which one is better for Lwt?

If making the decision based on performance only, I don’t know :slight_smile:

Lwt is inherently used for the I/O-bound fragment of each program, so it probably doesn’t matter.

Any program is likely to have all sorts of costs not related to which kind of variants are being used. Those will probably dominate the few nanoseconds that can be saved by choosing one variant kind over another.

Concerning ordinary variants, when the actual source-code variant is a GADT, pattern matching might be eliminated completely through really precise typing. This can have yet another tiny performance effect.

For Lwt, considerations other than variant performance ultimately control the choice of variant kind.

14 Likes

Section 4 of the paper Programming with Polymorphic Variants also discussed about its pattern matching performance.

Another advantage of this representation is that it allows for an
efficient compilation of matching. Admittedly, we cannot use table
switch, like with defined datatypes, so matching a variant is not
constant time. But since we know the hash value for each tag at
compile time, we can generate a choice tree, and do the matching
in log(n). Moreover, on many computer architectures conditional
jump is much faster than indirect jump, so that for modest size variants
types, we may even be faster than a table switch. Compiling to
native code on a DEC Alpha, with the Objective Caml 1.07 backend,
we noticed a more than 10 % speedup on ten way cases (the
benchmark is a single flat matching inside a for loop). Of course,
this speedup does not apply to bytecode.

5 Likes

Very interesting - I’d always wondered about this. How does it deal with hash collisions though?

1 Like

You get a compile time error.

5 Likes

Indeed. To clarify, since this also came up in another discussion, the hashes are evaluated entirely at compile time, only to generate the constructor representations. The compiler checks that there are no collisions. I should have put this in the actual write-up.

1 Like

Wow! If this is the case, it points to a performance issue with normal variants that should be addressed. If the compilation strategy is really this much slower, then it’s the wrong one. You should post an issue about it.

1 Like

Andrew Herron (@TheSpyder) suggested on Twitter that modern branch predictors are pretty smart, and it looks like he may be right. I replaced the sequential case access pattern with running a PRNG inside the pattern-matching loop, and got results that look like this instead:

16

Again, this is hardware architecture-specific.

I thought the sequential access pattern might be wrong because adjacent variands will trigger descent down nearly the same path in the binary search each time. So, if the branch predictor, say, caches the result of a few recent jumps, and predicts they will be taken the same way again the next time, it will predict correctly most of the time.

To check this, I also tried randomizing the array that controls the access pattern once, during initialization, using a PRNG, instead of generating random indexes during the loop. This produced the same results as in the original post, though, strongly favoring polymorphic variants on my machine. So, I’m not sure what the branch predictor is doing, or if it is even the branch predictor that is responsible for such performance. It may just be that an array access is easier for some kind of other speculative execution to handle, than whatever is done by the PRNG (module Random).

Anyway, polymorphic variants still seem to do pretty well. Just not better by an absurd and outrageous margin :slight_smile:

2 Likes

Hmm. It’s probably some combination of branch prediction and CPU caching. Any time code accesses memory that the predictor has loaded into cache already is a win; fetching from ram is quite slow by comparison.

Sequential memory access or small blocks of localised access that fit within a cache line (arrays) are much faster on modern CPUs than theoretically better algorithms that access memory randomly (pointers and jump tables). Micro benchmarks are hard :sweat_smile:

Your new graph looks about right to me though, with roughly linear increase in performance vs size for the polymorphic variant.

1 Like

I don’t think this is a code (nor data) caching or sequential access issue, though.

The amount of code generated even for the 128-case matches should easily fit into any modern code cache, and this cache is quickly warmed over the time span each test is run (I’ve tried 1 to 10 seconds, so we are talking about tens of millions to billions of repetitions). Data caches are even larger, and the access pattern array fits in those quite easily.

Concerning sequential access, sequential access of the array that contains the access pattern (when not using the PRNG) is factored into the control. Sequential access of the code itself should be mainly a non-issue due to warm caches, but also it can’t explain good performance observed with a randomized array of variands (see last big paragraph of the PRNG reply). I didn’t include the graph for that, because it’s basically identical to the one in the original post – a big advantage for binary search.

I do agree that the second graph is more likely to reflect realistic scenarios, though. There is something “special” happening with choosing the variands from an array. I don’t know if it’s due to the amount of code taken to do it, or something else that my CPU is able to optimize. I don’t think it’s due to the sequence of variands or the number of them, though.

1 Like

This is all really interesting,

Still seems to me like we should be changing the compilation strategy for normal variants based on this (assuming more benchmarks show the same trend). If modern CPUs are so good with branch prediction - and it seems like they are - we should be using branches rather than dispatch arrays. Branch prediction will also presumably keep improving.

If it’s easy for me to clone these tests, I would like to run them on ARM.

1 Like

Should be easy: see the benchmark files in the linked gist. Let me know if something is missing.

Having trouble to get the compiler to work on ARM.