Representing data more compactly but unsafely

I’m collecting cases where people modernized data representation either to store them more compactly or to store some extra data in unused bytes. If it is done using unsafe features of OCaml (or similar language runtime), then it is OK. I will state below a few examples that I found, and I hope that you can add some more.

  1. In DSL for relational programming we have values such that every part could be replaced by a logic variable. For example, instead of ground lists type 'a list = [] | (::) of 'a * 'a list we write everywhere types like:

    type 'a logic = 
      | Value of 'a 
      | Var of int     (* simplified *)
    type ('a, 'b) l = [] | (::) of 'a * 'b
    type 'a logic_list = ('a, 'a logic_list) l logic 
    

    After that we reuse OCaml values representation as a tree and perform unification of OCaml trees (working with the Obj module is involved).

    This representation has a performance drawback: all ground values are tagged with constructor Value and trees become twice large if we count pointers. To avoid the tagging we decided to have an abstract type for logic values that are represented in a more compact way: we store logic variables as before, but we don’t put Value constructors. After unification and substitution these unsafe values could be reified to typed representation above (‘a logic_list, for example)

  2. Another example is from the paper Full reduction at full throttle. There we have a tagged data type which is either a closure or a pair of something.

    type head = 
      | Lam of t -> t 
      | Accu of atom * t list 
    

    To avoid tagging authors decided to store values without constructors and discriminate two cases by constructor arguments: closures always have a tag Closure_tag and pairs have tag 0.

  3. Previous two cases were about storing data in a more compact form. Here is an example how language runtime could have tagged pointers to store extra information.

    Today pointers in our operation systems are byte-aligned, so we usually have three lower bits as zeros. We could reuse those bits to store some type information. OCaml runtime uses only the lowest bit to distinguish immediate values. Chez Scheme (BIBOP runtime), AFAIK, uses more bits to distinguish pairs from other values. As a consequence int pairs in Chez Scheme occupy only two words but in OCaml they occupy three. I think that this knowledge is widespread.

    What is less widespread (meaning I was surprised when I read it) is that we could reuse a few bits in the middle of the words if we decide to have double-precision floats as immediate values. Today NaNs are distinguished from non-NaNs by largest 12 bits leaving us 52 bits for pointers. Due to the maximum available memory, 48 bits should be enough for pointers, so we have 4 extra bits for type information in the middle of the word.

    I’m curious, which consequences will appear if we redesign OCaml runtime using BIBOP. But it’s probably a topic for another thread.

2 Likes

Not sure if this is exactly what you were looking for but @Yaron_Minsky wrote about using GADTs to achieve a safe compact array representation: Jane Street Tech Blog - Why GADTs matter for performance

As he noted there,

GADTs are about more than clever typed interpreters; they’re a powerful mechanism for building easy to use abstractions that give you more precise control of your memory representation.

2 Likes

Also not sure if it fits with what you have in mind, but some time ago I wrote some code to have somewhat optimized representation of source file locations in Dolmen : dolmen/loc.ml at master · Gbury/dolmen · GitHub

One part of the compactness actually comes from an algorithmic improvement where instead of having a record of 4 integers (start line, start column, end line, end column), we instead store 2 integers which represent the start and end byte offset of the location, and then use a table that stores the byte offset of each line start to translate these two integers to a full location. Then, one can pack these two integers into a single one (using regular, and somewhat safe arithmetic shifts). However, this limits the range of representable locations (particularly on 32bit archs), so to be complete, we also use the Obj module to use an actual record with two separate fields when the two integers cannot be safely packed into one.

So in that sense the use of unsafe features is really minimal but it basically allowed to go from using a 4-words record, which use a total of 6 words per location (pointer + block header + 4 words in the block), to using (most of the time) a single word (an int) per location. With big enough AST where each node has a location attached, it resulted in about 1/3 less memory used when parsing big files.

It would be really nice if we could add an annotation that optimized the data layout for us given all options listed in the type. Something like [@optimize_variants]. It would really give an advantage to regular variants over their polymorphic cousins.

Regarding nan-boxing, it really is cool. I assume the core team wouldn’t consider it due to the changes required to the language. Though it would dramatically improve the speed of float manipulation, it would impose maximum limits on memory size; cost a little bit of integer speed; take away the 63 bit floats we currently have; and make the runtime radically different for 32-bit and 64-bit platforms.

@yawaramin I have seen that link but in this thread I expect something more unsafe and hackery. But thanks for reminding

@zozozo, Thanks, it is exactly the stuff I wanted to find!

The Zarith wrapper around GMP declares a type t that could look like

type t =
  | SmallEnough of int
  | TooLarge of mpz_wrapper

But none of these two constructors actually exist. The code distinguishes between them using the least significant bit, which is naturally different for values of type int and values of type mpz_wrapper.

Thanks, silene. But your example seems to be the same as mine number 2 about full throttle.

The old mldonkey uses Obj.set_tag to simulate inline records. This was written before inline records were introduced in OCaml. You might find more examples in my audit of uses of Obj.set_tag before it was deprecated: other/async_audit/set_tag · master · gadmm / stdlib-experiment · GitLab.

The paper you cite (Full reduction at full throttle) is actually how Coq’s native_compute is still implemented. This implements normalisation-by-evaluation via native OCaml compilation and execution. It is even worse than you describe, both Lam and Accu are closures, so that they can be applied uniformly without looking at the tag (this is better explained in the paper). It causes some issues for migrating to the new representation of closures in no-naked-pointers mode.

The case mentioned by @silene is more reasonable, and there is work by Chataing and Scherer at the ML workshop to make it possible to express it one day in OCaml: https://cambium.inria.fr/~scherer/research/constructor-unboxing/constructor-unboxing-ml-workshop-2021.pdf.

It might be useful to know that the two least significant bits of pointers are reserved already, not just one. The second one is used in the runtime to express raising exceptions from C functions by returning an encoded exception ((e & 0b10) == 1). The two bits are also used to store information during compaction.

Lastly, storing information in pointers requires to mask this information before dereferencing which is costly in terms of code size and might affect performance (with the possible exception of re-using the non-significant msbs in ARM, but I am not sure of the details).

1 Like

Do you know about separate tagging in OCaml ? That is avoiding boxing by having a tag and its attached payload in separate values. I will show a very simple example, but it is surprising what you can achieve with it.

Here is an example of the different ways of programming a function that take either a float or an int and return the closest int.

type my_adt = TI of int | TF of float

(* Safe but uses unnecessary boxing *)
let f_safe = function TI i -> i | TF f -> int_of_float f

type my_tags = TI | TF

(* very unsafe but no useless boxing *)
let f_unsafe tag n =
  match tag with
  | TI ->
      (Obj.magic n : int)
  | TF ->
      int_of_float (Obj.magic n : float)

type _ my_gadt = TI : int my_gadt | TF : float my_gadt

(* Same runtime behaviour as f_unsafe, but safe. *)
let f_gadt : type t. t my_gadt -> t -> int =
 fun tag n -> 
  match tag with 
  |  TI -> n 
  | TF -> int_of_float n
3 Likes

To be clearer, the main issue is that there is currently no way to efficiently test whether a block has the closure tag. (Coq does not need to distinguish between accumulators and closures in the fast path.) That is why Coq accumulators are stored with tag zero, as it can be efficiently tested. If the OCaml compiler knew how to generate efficient code for Obj.tag x = Obj.closure_tag, Coq would not use tag zero for accumulators. Using the proper tag for accumulators (i.e., the closure tag) makes the corresponding tests of FiatCrypto five times slower, even when using custom C code instead of Obj.tag.

If that’s the only reason for this abominable accumulator hack, I wish I had known about it earlier… It should be very easy for the compiler to expose a function Obj.is_closure that is very fast. If the Coq team is interested in having that function, could you open a feature request on the OCaml repo ?

I also wish I understood that earlier! I think that beyond a fast Obj.is_closure, they are interested in the optimised compilation of pattern-matching. There is some discussion there: Do not rely on accumulators having tag 0 in native compilation. by ppedrot · Pull Request #14048 · coq/coq · GitHub.
If I understand correctly, they want something like:

type t =
| Closure of u -> u [@unboxed]
| Whatever of v

Is that correct?
Implementing this would also justify the related dirty trick in camlinternalOO.ml mentioned in the above PR. This might fall within the scope of @gasche’s project.

Yes. In fact, I did explain this exact use case in the corresponding RFC: https://github.com/ocaml/RFCs/pull/14#issuecomment-674446467

That is right. Currently, due to the way pattern-matching is implemented, testing for a closure with tag zero is basically free:

match foo with
| ClosureWithTag0 _ -> ...
| ActualConstructor1 x -> ...
| ActualConstructor2 x -> ...

Obj.is_closure would be suitable only if the following pattern-matching would be (almost) as fast:

match foo with
| x when Obj.is_closure x -> ...
| ActualConstructor1 x -> ...
| ActualConstructor2 x -> ...
1 Like

Heh, there now seems to be a prototype by @nchataing announced right this morning in the RFC you just linked.

2 Likes