Interesting article about OCaml integer arithmetic

This article raises an interesting point: Language choices, do they matter? Also, whats this ocaml? | Simple Automation

If the bit is set, OCaml considers the value an integer. If not, it thinks the value is a pointer to a larger object. This is a neat trick used in lots of places. The V8 engine in Chrome famously did this for numbers as well.

However, it does seem silly to do when the type system knows that the type is a simple integer.

It seems that the information that the value is an integer, is lost after type-checking, and is not used to simplify the generated Assembly?

The GC uses that bit to know whether a value is an immediate value or a pointer to a heap allocated block. The GC just needs some discriminant to be there to know how to treat a value, which is why it’s there at runtime despite the type checking. More information is here: Memory Representation of Values - Real World OCaml. Other approaches to this are to maintain this info in a separate data structure (I’ve seen it called a “stack map”), or to use conservative GC (where values that “look like” pointers are assumed to be pointers).

In dynamically typed language runtimes (like V8 mentioned in your quote), the trick used is usually that every value is treated as a double, and the NaN bits are used as the metadata (sometimes called NaN-tagging or NaN-boxing). There’s a pretty good description here.

In at least some JVMs, everything – everything – comes with type-maps. Of course, every object in the heap has a class, and those have type-maps that tell you which slots are primitive values and which are pointers. Every stack-frame either has a type-map, or one can be generated (e.g. the Solaris JVM used to have a cache of 'em – when that cache filled-up, you’d get inexplicable slowdowns[1] as the VM had to recompute type-maps, only to have them flushed b/c the cache was too full). Then the equivalent of the OCaml runtime requirement that “you can only suspend execution at ‘safe points’ where all values are ‘in representation’”, becomes “you can only suspend execution at ‘safe points’ where it is known how to compute the type-map of the current stack-frame”.

It’s been a long time since I worked in JVMs, but this was the state of play in the mid-00s.

ETA: oof, I forgot the terminology! When it’s a stack-frame, it’s called a “stack-map”, just as @nated says.
[1] gosh, how could I know this? Heh, had to debug that problem onsite at a customer; thank goodness my partner-in-crime had the source to the Solaris JVM, and we could build a custom version with a double-sized stack-map cache.

1 Like

The article overlooks the fact the fact that most of the time the typechecker does not know that a value is an integer, due to polymorphism and abstract types.

Locally, the compiler will perform some untagging optimizations, as you can observe if you change the function in the article to

let square x = x * x * x * x

which produces assembly that looks like:

camlExample__square_5:
        movq    %rax, %rbx
        sarq    $1, %rbx
        movq    %rbx, %rdi
        decq    %rax
        imulq   %rdi, %rax
        imulq   %rbx, %rax
        imulq   %rbx, %rax
        incq    %rax
        ret

As you can see, the tag is only added/removed once, not three times. However, the program must revert to the general representation at function boundaries in order to be able to compose with the rest of the program.

Finally, while maintaining the tag bit on integers carries some cost, this cost is rather small (a few percentage points for integer-heavy computations, see section 5 of https://xavierleroy.org/publi/unboxing-tic97.pdf).

Cheers,
Nicolas

5 Likes