Polymorphic variant representation


I am trying to understand the difference between standard and polymorphic variants in term of memory representation and code optimization.
So far, what I understood is that:

  • standard variants:
    • non parametrized variants are represented as integer, starting from 0 (0x1) with incrementing tag;
    • parametrized variants are represented as memory blocs with a tag stored in the header – as a byte, tags start at 0 and range up to 255 (with some reserved values at the end).
  • polymorphic variants:
    • non parametrized variants are represented as integer – it is the hash of its name;
    • parametrized variants are represented as memory blocs with tag 0 and their first field is the hash of their name.

I can think about two drawbacks for using polymorphic variants:

  • parametrized ones use one more word in memory;
  • unlike the tag that is bound to small numbers, the hash can range from 0 to 2^31+ – some optimization like pattern matching with jump table can not be used, forcing to use a sequence of if-then-else.
    Thus, polymorphic variants can end up producing slightly slower running code.

Am I forgetting something or is it all?

Also, I am wondering, as polymorphic variants use hash function, is a collision possible for two distinct variants? Theoretically, it is but I am expecting it to be very (very…) rare events.
Is there, in practice, some ̀hash` properties by design, that there is no collision for any string of size lower than X where X is the longest name writable for a variant?


This is not really in scope for your question, but another disadvantage of polymorphic variants is that their typing rules are more complicated.

There’s an advantage to polymorphic variants that you have not mentioned: you can pass their values from two different types without the cost of a conversion (if the types are compatible).

I don’t know the exact properties of the hash function, but I believe there is a compile-time check that no two variants with the same hashes are used in the same compilation unit (it checks on all variant constructors appearing in types, so even if you have a collision, you will never be able to build a program manipulating both values together). It’s possible that you could observe a collision with the polymorphic comparison on values whose type has been existentially quantified, but that’s not specific to polymorphic variants (you can check that 0 has the same representation as false the same way).


$ ocaml
        OCaml version 4.10.0

# [`squigglier; `premiums];;
Line 1, characters 14-23:
1 | [`squigglier; `premiums];;
Error: Variant tags `squigglier and `premiums have the same hash value.
       Change one of them.

It is true, but they allow more flexibility and I expect them to be easier to learn, from a beginner point of view, than GADT.

You are right. It is exactly why I am looking for polymorphic variant. In this different topic Sum types: sub- & super-type, I present two use cases where polymorphic variants perform well but, are maybe too powerful for what I want to do.

Oh… so it is possible, but is caught by the compilation.
I am curious, is it an already known collision example, or did you find it just for me? :slight_smile:

It’s mentioned in https://dl.acm.org/doi/10.1145/1292535.1292548 (2007)
See also https://www.google.com/search?q=“squigglier”+ocaml